Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts

Román Martínez, Javier; Triana Robles, David; El Oualidi Charchmi, Mouhcine; Salamanca Estévez, Ines; DeCastro-García, Noemí

doi:10.3390/app15084090

Open AccessArticle

Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts

by

Javier Román Martínez

¹

,

David Triana Robles

¹

,

Mouhcine El Oualidi Charchmi

¹

,

Ines Salamanca Estévez

¹

and

Noemí DeCastro-García

^1,2,*

¹

Research Institute of Applied Sciences in Cybersecurity (RIASC), Universidad de León, Campus de Vegazana, s/n, 24007 León, Spain

²

Department of Mathematics, Universidad de León, Campus de Vegazana, s/n, 24007 León, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4090; https://doi.org/10.3390/app15084090

Submission received: 12 March 2025 / Revised: 29 March 2025 / Accepted: 1 April 2025 / Published: 8 April 2025

(This article belongs to the Special Issue Advances and Applications of Generative AI: Bridging Theory and Practice)

Download

Browse Figures

Versions Notes

Abstract

:

The increasing reliance on artificial intelligence in cybersecurity has broadened the role of generative artificial intelligence in tasks such as text generation and translation. This study assesses the effectiveness of generative artificial intelligence and conventional translation tools in translating early vulnerability alerts from English to Spanish—a critical process for ensuring the timely dissemination of cybersecurity information. Utilizing a dataset provided by the Spanish National Cybersecurity Institute, translations were generated using various systems and evaluated through linguistic assessment metrics, including methods measuring lexical similarity and others capturing semantic meaning beyond direct word matching. Additionally, word embeddings were employed to enhance the accuracy of semantic similarity analysis. The results indicate that conventional translation tools generally exhibit greater accuracy and structural fidelity, whereas generative artificial intelligence produces more natural-sounding translations. However, this flexibility results in greater variability in translation quality. The findings suggest that while generative artificial intelligence may serve as a valuable complement to traditional tools, its inconsistencies may limit its suitability for highly technical content that demands precision. This study underscores the importance of integrating both approaches to improve the accuracy and accessibility of cybersecurity alerts across different languages.

Keywords:

generative artificial intelligence; machine translation; automated translators; cybersecurity; vulnerabilities

1. Introduction

With the exponential growth in computational power and the increasing availability of large datasets, artificial intelligence (AI) has advanced toward more sophisticated methodologies, enhancing both the performance and capabilities of the systems that employ it. Within this domain, generative artificial intelligence (GenAI) [1] has emerged as a particularly innovative field. GenAI-driven models have significantly influenced natural language generation, image translation, and interdisciplinary applications, establishing themselves as essential tools across various domains, including automated content creation, productivity enhancement, and process optimization [2,3].

In the realm of cybersecurity, GenAI is gaining increasing relevance by offering novel mechanisms for identifying, mitigating, and responding to cyber threats through the simulation of attack scenarios. However, its deployment also introduces significant challenges due to the potential for misuse, including the orchestration of advanced phishing campaigns, the creation of sophisticated malware, the automated dissemination of disinformation, and identity fraud. Moreover, new attack vectors have emerged that specifically target vulnerabilities within GenAI systems to gain unauthorized access to sensitive information.

This article presents a use case for the application of generative AI in a previously unexplored area of cybersecurity: the translation of early vulnerability alerts into Spanish. According to the National Institute of Standards and Technology (NIST), a vulnerability is defined as a “weakness in a system that could be exploited by a threat” [4]. Vulnerabilities can manifest in hardware, software, or processes and may result from programming errors, misconfigurations, or design flaws. Identifying existing vulnerabilities, applying timely updates, and deploying patches are critical measures to minimize the risk of exploitation. Early warning mechanisms play a crucial role in vulnerability management, allowing organizations to identify and mitigate risks before they can be exploited. However, this proactive approach may be less effective if vulnerability-related information is not accessible in the primary language of a significant portion of the cybersecurity community.

In this context, the Spanish National Cybersecurity Institute (INCIBE) provides a publicly accessible database containing Spanish-language replicas of the latest documented vulnerabilities originally published in English by NIST (see [5]). This repository, comprising over 200,000 records, is maintained through manual translations performed by professional translation teams under a collaboration agreement between INCIBE and NIST. Each vulnerability entry includes technical descriptions, severity assessments, publication and update dates, impact evaluations, and available mitigation strategies. While this resource is invaluable to researchers and cybersecurity professionals, the manual translation process is time-intensive.

The manual translation of vulnerability descriptions entails a substantial daily workload. Given that these descriptions must be processed rapidly and efficiently to serve as effective early warnings, any delay in their translation and dissemination could compromise the security of affected systems. Additionally, the dynamic nature of vulnerabilities necessitates continuous updates, further complicating the translation process. These factors underscore the need for automation in this domain. With this objective in mind, this study aims to identify the most accurate machine translation system for this task. Although GenAI-based models were not originally designed for translation, they have demonstrated capabilities that rival or even surpass those of dedicated translation systems [6]. Integrating GenAI into this process could reduce the manual workload while enhancing translation speed and accuracy, ensuring that critical cybersecurity information reaches users in a timely manner. Consequently, applying GenAI to vulnerability translation is not merely an operational necessity but a strategic approach to strengthening the effectiveness of early warning systems and mitigating security risks in an increasingly digitalized environment. This shift not only addresses an operational bottleneck but also opens new avenues for research in applied natural language processing, real-time multilingual cybersecurity intelligence, and human-AI collaboration in high-stakes environments.

This study examines the incorporation of GenAI and machine translation systems into the Spanish translation process of early vulnerability alerts. Given that the performance of generative AI tools varies depending on the specific use case [7], the primary objective is to determine which system produces the highest quality translations that align with the technical style of existing INCIBE translations. Additionally, the selected model must be both cost-effective and high-performing. To achieve this, an experimental study was conducted using an original dataset provided by INCIBE [8], which is included as a publicly available resource in this article, along with the translations generated by each system. The quality of these translations was assessed using multiple machine translation evaluation metrics that measure different linguistic aspects, including lexical similarity, word order, and edit distance. These results were further analyzed through statistical inference, leading to the development of a style adherence ranking to identify the most suitable model for the early vulnerability alert translation process. Furthermore, word embeddings were employed to assess the semantic quality of both reference translations and machine-generated outputs. By representing words in a multidimensional vector space, this approach enables a more nuanced comparison of semantic similarity, moving beyond traditional lexical matching techniques. Consequently, word embeddings provide a more robust, context-aware method for evaluating translation quality.

The remainder of this article is structured as follows: Section 2 presents the background, Section 3 details the experimental methodology, Section 4 discusses the results, and finally, Section 5 provides the conclusions and references.

2. Background

This section presents various approaches and relevant tools for translating vulnerabilities. First, Section 2.1 addresses the importance of early vulnerability alerts. Next, Section 2.2 explores different machine translation tools. Finally, Section 2.3 reviews the metrics used to evaluate translation quality.

2.1. Early Warning of Vulnerabilities

A vulnerability is defined as a weakness in an information system, its security procedures, internal controls, or implementation that can be exploited by a threat actor to compromise the system’s confidentiality, integrity, or availability [9]. This formal definition underscores the critical importance of identifying and mitigating such weaknesses to protect essential information assets. Recent reports [10] indicate that organizations that proactively implement security measures can save an average of approximately USD 2.22 million compared to those that do not adopt such preventative actions. This picture highlights the substantial economic benefits of preventing the exploitation of vulnerabilities before they result in costly data breaches. Through early warning services, identified vulnerabilities can be rapidly disseminated, ensuring that all relevant information is accessible to organizations and end-users with the objective of minimizing potential damage. However, the effectiveness of such prevention and awareness efforts is significantly diminished if the vulnerability descriptions are not available to the entire cybersecurity community. With over 580 million speakers worldwide, Spanish is one of the most widely spoken languages and ranks second in terms of native speakers. Translating early vulnerability alerts into Spanish is therefore crucial for ensuring that Spanish-speaking professionals and users can access critical cybersecurity information in their native language. This facilitates rapid comprehension and response to potential threats, reducing the risk of misinterpretation and enhancing collaborative efforts in implementing security measures. In this context, INCIBE’s translation process plays a vital role in providing Spanish-speaking professionals with timely access to essential security information, thereby lowering language barriers and fostering a swift, coordinated response to emerging threats.

The translation and publication of vulnerability descriptions follow a structured sequence involving multiple stages and stakeholders. The process begins when the National Institute of Standards and Technology (NIST) publishes a new vulnerability in the National Vulnerability Database (NVD), including its original technical description in English under the label Vulnerability Description. INCIBE automatically retrieves the information from the NVD via a feed, making it readily available to its users. At this stage, the INCIBE team manually translates the vulnerability description into Spanish, supplementing it with a descriptive title summarizing the nature of the vulnerability. Once the translation is completed, the content undergoes a review process by professional translators to ensure both technical and linguistic accuracy. Following verification, the translation is submitted to the National Vulnerability Database (NVD). After that, INCIBE retrieves the translated content and publishes it on its website, ensuring that Spanish-speaking users can access vulnerability information with optimal clarity and detail. Each vulnerability report contains the following information in Spanish: unique identifier, severity level, error category, publication and last modification dates, technical description, impact assessment, references, and mitigation solutions. These reports not only identify security issues but also provide essential tools and strategies for understanding and addressing vulnerabilities, thereby supporting the work of researchers and cybersecurity professionals. Additionally, the repository includes the CVE (Common Vulnerabilities and Exposures) identifier, a standardized system that assigns a unique identifier to each known vulnerability, facilitating consistent and efficient referencing across cybersecurity databases [11].

The procedure described above requires continuous updates and entails significant financial and human resource expenditures, making it a strong candidate for automation through machine translation tools. Recent findings presented in [12] indicate that 29.4% of professional translators regularly utilize GenAI and machine translation systems as complementary tools in their work. While AI-generated translations still require human intervention to refine stylistic elements and enhance contextual accuracy [13], these technologies have proven particularly effective for translating scientific texts [13], where fluency and cultural adaptation are less critical factors. In the specific context of translating early vulnerability alerts into Spanish, the primary objective is to faithfully replicate the English source text with a translation that is as technically precise and rigorous as possible. Given this requirement, AI-based translation emerges as a highly suitable solution. Moreover, the integration of GenAI into this process not only enhances translation efficiency but also optimizes workflows, ensures terminological consistency, and facilitates rapid adaptation to evolving vulnerability descriptions. By leveraging AI-driven translation tools, it is possible to streamline the translation process while maintaining the high level of accuracy required for cybersecurity communications.

2.2. Machine Translaters

Natural Language Processing is a subfield of artificial intelligence that focuses on enabling computers to process and understand human language. It encompasses a wide range of tasks, including text and sentiment analysis, question answering, and machine translation. This study is situated within the latter domain.

According to the European Association for Machine Translation [14], machine translation (MT) is defined as the application of computers to the task of translating texts from one natural language to another. Before the advent of Large Language Models (LLMs), MT was primarily achieved through Rule-Based Machine Translation and Statistical Machine Translation. The first ones relied on predefined linguistic rules and dictionaries to transform text between languages, necessitating extensive linguistic expertise and manual programming [15]. In contrast, statistical machine-based translators utilized large parallel corpora to identify translation patterns and estimate probabilistic correspondences between phrases or words in different languages [16]. The subsequent shift to Neural Machine Translation introduced encoder-decoder architectures based on Recurrent Neural Networks, often enhanced with attention mechanisms to improve translation quality [17]. This transition marked the emergence of more advanced techniques, culminating in the development of LLMs based on Transformer architectures [18]. LLMs, such as GPT and BERT [19], are trained on vast amounts of textual data and are capable of understanding, generating, and manipulating natural language. Their ability to capture complex linguistic patterns underpins many advanced NLP applications, including MT. Consequently, two broad categories of translation systems have emerged: dedicated machine translation systems and generative artificial intelligence (GenAI) models, which incorporate translation as one of their multiple capabilities.

Initially, translation systems such as DeepL [20], Google Translate [21], Bing Translator [22], Systran [23], and Reverso [24] employed rule-based models. These systems later evolved through statistical methodologies before transitioning to deep neural networks and Transformer-based architectures, significantly enhancing translation fluency and accuracy. In parallel, GenAI models such as ChatGPT-4o mini [25], Gemini [26], and Copilot [27] have emerged as versatile tools capable of generating content, including translations. While traditional MT systems are optimized for linguistic precision, GenAI models offer greater adaptability across diverse tasks.

Both approaches present distinct advantages depending on the use case. Dedicated MT systems are designed to maximize linguistic accuracy, whereas GenAI models exhibit higher flexibility, allowing them to handle a broader range of language-related tasks. However, GenAI tends to prioritize fidelity to the source text, sometimes at the expense of naturalness in certain contexts. Notably, among professional translators who incorporate GenAI into their workflows, ChatGPT is the most widely used tool, followed by Microsoft Copilot [12]. Furthermore, according to Google Trends data [28], Google Translate has historically been the most widely used MT system. However, since the launch of ChatGPT in November 2022, its adoption has grown rapidly, surpassing Google Translate as of September 2024.

Another key distinction between machine translation systems and GenAI lies in the use of prompts. Prompts serve as the primary interface through which users interact with GenAI models, typically consisting of a sequence of characters that correspond to natural language text. These inputs are processed by the model, which generates responses based on its underlying architecture and training data. In contrast, traditional machine translation systems directly accept the text to be translated as input without requiring explicit task specification beyond selecting the target language. This difference also leads to variable results, as the outputs of the translations may vary depending on the prompt used.

To systematically compare these tools, a literature review has been conducted, assessing various characteristics as summarized in Table 1. This section categorizes the findings into three main areas: machine translation systems, translation platforms, and GenAI models.

DeepL [20] is a machine translation service developed by DeepLSE. Although it does not support multitasking, it offers advanced functionalities through its API, enabling automation and script-based translation workflows. In terms of input formats, DeepL supports a wide range, with size and character limitations depending on the subscription plan. The translation output varies based on the interface: the DeepL web platform provides direct text-based translations, whereas the API allows results to be obtained in JSON format for text or as translated documents. While DeepL does not permit direct modifications to its translation models, it offers customization features, such as the selection of preferred terminology. The service operates exclusively online via its official platform and is classified as a general-purpose tool due to its ability to translate diverse text types across various domains. Although the update frequency is not explicitly stated, DeepL regularly releases improvements to enhance translation quality and overall functionality.

ChatGPT-3.5 [25], developed by OpenAI, is a versatile language model capable of multitasking, allowing it to process multiple requests simultaneously. It provides an API, though it operates under a paid model, with pricing contingent on usage and the selected model version. Through its API and automated scripting, ChatGPT-3.5 can facilitate translation processes efficiently. The tool supports various input formats, including text, voice, and multiple file types, while generating outputs primarily in text format. As a general-purpose natural language processing model, ChatGPT-3.5 is not explicitly designed for translation but demonstrates high adaptability across linguistic tasks. However, it does not support the direct integration of specialized corpora for domain-specific translations. The tool functions exclusively online and undergoes frequent updates to refine its capabilities and expand its knowledge base.

Google Translate [21], developed by Google, is a widely used machine translation tool. Unlike some AI-based models, it does not support multitasking, meaning it processes translations sequentially. However, it provides an API that requires a billing account on the Google Cloud Platform, enabling automation through scripts for text input and retrieval of translated content. Google Translate supports a diverse range of input formats, including text, voice, images, documents, and web pages. Translations are primarily output as text, though document and website translations are also available. The platform allows certain modifications and customizations, depending on the version used, and employs a combination of linguistic data, human translations, and machine learning techniques to improve translation accuracy. Additionally, Google offers AutoML Translation, which enables the incorporation of custom datasets, allowing for domain-specific adaptation. While Google Translate primarily functions as an online service, it also provides offline capabilities through downloadable language packs on mobile devices.

Copilot [27] is an artificial intelligence tool developed by Microsoft. Unlike other translation tools, Copilot does not support multitasking, meaning it cannot efficiently handle multiple concurrent translation requests. Although it does not provide a public API, it can be integrated with the Microsoft Translator API to automate the translation process, supporting multiple input formats. Translation outputs are primarily generated in text format, and the tool does not allow direct modifications to its translation models or the inclusion of specific corpora. Copilot operates exclusively online and is classified as a generalist tool, capable of translating a broad range of topics and text types. Microsoft updates Copilot approximately once a month, improving its functionality and accuracy and introducing new features. Additionally, Copilot imposes a limit of 30 responses per conversation or topic. A specialized variant, Copilot for Security, is tailored for cybersecurity applications. Unlike the standard version, it is optimized for processing and analyzing information related to cyber threats, vulnerabilities, security mitigations, and technical incident reports. While not specifically designed for translation, it can be leveraged for this purpose in cybersecurity contexts.

Bing Translator [22], also developed by Microsoft, is designed for rapid and accurate machine translation. Notably, it supports multitasking, allowing multiple translations to be processed simultaneously, making it well-suited for high-volume text translation and real-time translation needs. Bing Translator offers a robust API that can be tested for free, enabling integration with scripts for automated text input and retrieval and streamlining translation workflows. The tool supports various input formats, including text, audio, and documents, generating translation outputs in both text and JSON format via the API. Users can fine-tune translations for specific words or expressions, though the exact sources of training data are not publicly disclosed. Bing Translator operates exclusively online and is frequently updated to enhance its accuracy and functionality.

Gemini [26] is an advanced AI-powered translation tool developed by Google. Unlike some traditional machine translation systems, Gemini supports multitasking, allowing it to handle multiple tasks concurrently. The tool provides an API accessible via the Google Cloud Platform, enabling integration with automation scripts for efficient text input and translation retrieval. Gemini accepts diverse input formats, including text and multimedia content such as images and videos. However, its translation outputs are generated exclusively in text format. While Gemini does not allow direct modifications or customizations, it can be supplemented with complementary tools such as Google Cloud AutoML and Google AI Platform to manage specialized corpora for domain-specific translation. Operating solely online, Gemini relies on publicly available and anonymized user data for its translations. The tool is frequently updated to improve functionality and translation accuracy and is classified as a generalist tool capable of handling a wide variety of translation tasks across different subject domains.

E-Translation [29] is a machine translation tool developed by the European Commission. While it does not fully support multitasking, it allows for the simultaneous translation of multiple documents. The tool provides an API [30] that facilitates machine-to-machine communication, enabling automation through scripts for text input and retrieval. In terms of input formats, E-Translation supports a wide range, with a daily limit of 100 documents and a character limit of 2500 for individual text submissions. Translation outputs are generated in text format, and translated files are provided in .docx format. The tool does not allow direct modifications to its translation models or customizable programming versions. E-Translation is trained on various sources, including EU documents, translation memories, and publicly available multilingual texts. While these corpora cannot be modified, public entities may request adjustments in specific cases. The tool receives periodic updates to enhance its functionality and accuracy.

Pangeanic [31] is a translation tool developed by the company of the same name. It does not support multitasking, as it cannot handle multiple translation tasks simultaneously. However, Pangeanic offers an API that enables automation for inputting and retrieving translated texts. The tool supports a wide variety of input formats, generating translation outputs in both text and document formats. Unlike some competitors, Pangeanic does not allow direct modifications to its translation models or provide user-programmable versions. However, it offers specialized dictionaries, allowing for limited customization. The tool utilizes proprietary specialized dictionaries for translation, though the exact sources of its data are not explicitly disclosed. Operating exclusively online, Pangeanic is classified as a generalist translation tool, though it offers the possibility of specialization through domain-specific corpora provided by the company. While the update frequency is not explicitly stated, Pangeanic regularly releases updates to enhance service quality and functionality. Additionally, its API provides a fast and efficient solution for machine translation, allowing users to integrate translation capabilities directly into their content management systems.

Systran [23] is an advanced artificial intelligence tool developed by Chapsvision, designed with multitasking capabilities that allow multiple translation tasks to be processed simultaneously. The platform provides an API with two modes: a free version with a 500,000-character limit and a professional version priced at EUR 8.99 per month with no volume restrictions. Through the Systran API and automation scripts, users can efficiently input text for translation and retrieve results. The tool supports a wide range of input formats and document types, each with specific size limitations depending on the format. Translation results are generated in JSON format when using the API for both text and document translations. Notably, Systran allows customization and modification of its translation models through its Profiles API [32], enabling the integration of specific corpora or dictionaries for domain-specific translations. The tool operates exclusively online and is updated periodically to improve functionality and accuracy.

Reverso [24], developed by Softissimo INC, is an artificial intelligence tool specializing in translation and natural language processing. Unlike general-purpose multitasking tools, Reverso is designed to provide precise translations alongside complementary linguistic tools tailored for individual and professional users. While Reverso does not officially offer a public API, community-developed tools and APIs are available [33], though their usage may be subject to restrictions imposed by the Reverso team. The platform does not support automation for inputting and retrieving translated texts, thereby limiting its integration with advanced automation systems. It accommodates a wide range of input formats, with translation outputs available in plain text and .docx formats. However, Reverso does not permit direct modifications within the platform, nor does it provide programmable versions for user-driven customizations.

ALIA [34] is an artificial intelligence tool developed by the Government of Spain in collaboration with the Barcelona Supercomputing Center. It is distinguished by its specialization in natural language processing for Spanish and the country’s co-official languages, offering open-source models accessible to the research community and for application development. Notable models within the ALIA framework include ALIA-40B, a model comprising 40 billion parameters trained on 6.9 trillion tokens, designed for advanced text generation tasks; Salamandra-7B, an optimized model with 7 billion parameters trained from scratch on 12.875 trillion tokens; and Salamandra-2B, a lightweight model with 2 billion parameters trained on 12.9 trillion tokens. The tool accepts text as its input format and generates output in plain text. Although it does not support the integration of customized corpora for specialized translations, its open-source nature facilitates external modifications and adaptations.

Deep-Translator [35] is an open-source translation tool renowned for its flexibility, accessibility, and unrestricted usage. Designed to streamline translation between multiple languages, it enables the utilization of various translation services through a unified interface, enhancing both versatility and ease of access. Among the supported translation services are Google Translate, Microsoft Translate, Pons Translator, Linguee Translator, MyMemory Translator, Yandex Translator, Qcri Translator, DeepL Translator, LibreTranslate, and ChatGPT.

OpenNMT [36] is an open-source neural machine translation platform that facilitates the training and deployment of neural network-based translation models. Originally developed in Lua and later ported to PyTorch, OpenNMT is recognized for its adaptability to various natural language processing requirements. However, the OpenNMT-py repository [36] is no longer under active development. Instead, the research team has introduced a new project, Eole [37], which extends and enhances the capabilities of OpenNMT-py.

THUMT [38] is an open-source neural machine translation toolkit developed by the Natural Language Processing Group at Tsinghua University. It is designed to facilitate research and development in neural network-based machine translation systems. The toolkit has evolved to support multiple machine learning frameworks, including PyTorch, TensorFlow, and Theano, although the latter is no longer actively maintained.

DeepSeek R1 [39] is a large language model developed by DeepSeek, released in January 2025. It represents an improved iteration of DeepSeek V3, which was trained with 671 billion parameters and 14.8 trillion tokens, utilizing approximately 2.788 million GPU hours. DeepSeek R1 emphasizes enhanced reasoning capabilities achieved through reinforcement learning. While it does not support multitasking, it provides API access for automation and integration into various applications. The model accepts text input and generates outputs in both text and JSON formats. While direct modifications are not permitted, the model supports fine-tuning for specialized tasks. DeepSeek R1 operates exclusively online and is categorized as a generalist AI model, receiving periodic updates to enhance its performance.

Grok-3 [40] is a language model developed by xAI, the artificial intelligence company founded by Elon Musk. Released in February 2025, it was trained using computational resources ten times greater than those employed for its predecessor, Grok-2, leveraging approximately 200,000 GPUs housed within the Colossus data center. Grok-3 demonstrates exceptional proficiency in mathematical and scientific reasoning, surpassing models such as GPT-4o and DeepSeek R1 in multiple benchmark evaluations. It supports multitasking and offers API access for automation. The model processes text input and delivers output in text and JSON formats. While direct modifications are not permitted, some customization options are available. Grok-3 is exclusively an online model, receiving frequent updates aimed at enhancing its capabilities.

Amazon Nova [41] is a family of language models introduced by Amazon in December 2024. It comprises three variants—Nova Micro, Nova Lite, and Nova Pro—each tailored for distinct levels of performance and scalability. While the specific parameter and token counts remain undisclosed, these models are optimized for both commercial and consumer applications. Amazon Nova provides API access, enabling automation and seamless integration with diverse services. The model processes text input and generates outputs in text and JSON formats. While customization options are limited, fine-tuning is available for enterprise users. The models function exclusively online and receive regular updates to enhance efficiency and accuracy.

The tools analyzed are included in Table 2.

2.3. Quality Assessment Indicators

This section presents a literature review of the most commonly utilized metrics for evaluating machine translations. The review was conducted following the methodology outlined in [42], utilizing the Scopus and IEEE Xplore databases. Below, we summarize the most prevalent automatic translation evaluation metrics identified in the scientific literature.

BLEU (Bilingual Evaluation Understudy Score) [43] is among the most widely adopted metrics for assessing machine translations. It is a string-based metric, meaning it evaluates the structural alignment of words between the “hypothesis” translation and the “reference” translation. Specifically, BLEU measures the precision of n-grams between the two translations. This entails comparing contiguous sequences of n words to determine how many match those in the reference translation—beginning with single-word sequences and progressing to multi-word sequences up to the maximum defined n-grams. Furthermore, BLEU incorporates a length penalty to account for discrepancies in word count, penalizing excessively short translations through a “brevity penalty”. The BLEU score for a translation is computed as follows:

BLEU = B P \times exp (\sum_{n = 1}^{N} w_{n} \cdot log p_{n})

where

w_{n}

represents the weight of each n-gram,

p_{n}

is the precision of the n-grams, and

B P

is the brevity penalty, which is defined as

B P = \{\begin{matrix} 1 & if c > r \\ exp (1 - \frac{r}{c}) & if c \leq r \end{matrix}

where c is the length of the hypothesis translation and r is the length of the reference translation.

The BLEU score ranges from 0 to 1, with values closer to 1 indicating a higher degree of similarity to the reference translation and, consequently, better translation quality. However, BLEU has certain limitations, including its inability to account for context or semantic fluency, which may hinder its capacity to fully capture human translation quality.

GLEU (Google Bilingual Evaluation Understudy Score) [44] is a string-based metric developed by Google that introduces key modifications to address issues related to the excessive use of common n-grams. Unlike BLEU, GLEU evaluates both the correct n-grams present in the translation and the incorrect ones absent in the reference translations, thereby offering a more balanced and appropriate assessment for machine learning-generated translations, particularly those produced by neural network-based machine translation systems.

Similar to BLEU, GLEU relies on n-grams; however, in this case, the score is derived by computing the Precision and Recall of the translations as follows:

Precision = \frac{common n - grams}{total n - grams in the hypothesis} = \frac{α}{β_{H}}

(1)

Recall = \frac{common n - grams}{total n - grams in the reference} = \frac{α}{β_{R}}

(2)

The final score assigned to the translation is the minimum between precision and recall:

GLEU = min (\frac{α}{β_{H}}, \frac{α}{β_{R}}) = \frac{α}{max (β_{H}, β_{R})}

(3)

The result of applying this metric always falls within the range of 0 to 1, with higher values indicating better translation quality. Similar to BLEU, this metric relies exclusively on the string input it receives, meaning it does not assess the semantics or fluency of translations.

CHRF (Character n-gram F-score) [45] is a string-based metric designed to address the limitations of previous evaluation methods. Instead of computing n-grams based on whole words, as BLEU does, CHRF evaluates the match of character n-grams. This characteristic makes CHRF particularly effective for morphologically rich languages and in scenarios where translated words may exhibit minor variations or spelling errors.

The evaluation of each translation is determined by the following:

F_{β} = (1 + β^{2}) \cdot \frac{P \cdot R}{β^{2} \cdot P + R}

where P represents precision, R denotes recall, and

β

serves as the weighting parameter.

The scores produced by this metric always range between 0 and 1, with higher values indicating superior translation quality.

METEOR [46] is a string-based metric developed to address the limitations of traditional metrics such as BLEU, which primarily focus on exact word matches, n-gram matches, or overall sentence length. METEOR incorporates multiple factors to generate a more refined evaluation score, surpassing the effectiveness of simple word-matching techniques. To achieve this, in addition to exact word matches, it considers synonym matches, morphological variations (e.g., “stop” and “stopping”), and word order alignments.

Like previous metrics, METEOR relies on a reference translation as input. Before conducting its calculations, it determines the value of m, which represents the number of words in the hypothesis translation that also appear in the reference translation. The translation score is computed using the following expression:

METEOR = F_{mean} \cdot (1 - Penalty) = \frac{P \cdot R}{α \cdot P + (1 - α) \cdot R} \cdot (1 - γ \cdot {frag_frac}^{β})

where

γ

is the weight assigned to the penalty,

β

controls the shape of the penalty function, and

P = \frac{m}{| hypothesis |}

where

| hypothesis |

is the length of the hypothesis translation,

R = \frac{m}{| reference |}

where

| reference |

is the length of the reference translation, and

frag_frac = \frac{c h}{m}

where

c h

represents the number of fragments.

The scores generated by METEOR range from 0 to 1, with higher values indicating superior translation quality.

RIBES (Rank-based Intuitive Bilingual Evaluation Score) [47] is a lexical metric specifically designed for languages with significant syntactic differences. It primarily evaluates the order and correlation of words between the reference and hypothesis translations, employing a rank-based positional correspondence approach to achieve this analysis.

To assess word order, RIBES computes Kendall’s

τ

coefficient. Additionally, it incorporates a length penalty in the final score to prevent truncated sentences from receiving disproportionately high scores.

The formula for calculating RIBES is given by the following:

RIBES = nkt \times ({p 1}^{α}) \times ({bp}^{β})

(4)

where

nkt: Kendall’s $τ$ coefficient;
p1: unigram precision;
bp: length penalty;
$α$ , $β$ : hyperparameters that control the weight of unigram precision and length penalty, respectively.

TER (Translation Edit Rate) [48] is a string-based metric designed to quantify the human effort required to transform a machine-generated translation into a perfect human reference. It computes the proportion of minimal edits necessary, including insertions, deletions, substitutions, and reordering operations. Given this approach, a higher TER score signifies a greater effort required to refine the hypothesis translation. The final score is calculated as follows:

TER = \frac{# substitutions + # insertions + # deletions + # shifts}{length (reference)}

(5)

The TER score ranges from 0 to 1, where a lower score (closer to 0) indicates a higher quality translation, as it requires fewer modifications. This aspect is particularly relevant when comparing TER values with those generated by other evaluation metrics.

YiSi (YiSi—Semantic Similarity Evaluation) [49] employs semantic embeddings, such as those derived from pretrained language models, to capture deeper relationships between words beyond surface-level lexical matches. Unlike previous metrics, YiSi is capable of evaluating the semantic meaning of the translations under comparison. This capability makes it particularly effective for assessing translations that require precise comprehension of meaning; however, it also entails higher computational complexity.

YiSi consists of three variants (YiSi-0, YiSi-1, and YiSi-2). In this study, YiSi-1 has been selected due to its demonstrated highest correlation with human judgment and its strong emphasis on semantic comparison.

The final score for the YiSi-1 metric is computed as follows:

YiSi - 1 = \frac{precision \cdot recall}{β \cdot precision + (1 - β) \cdot recall} =

\frac{(ϕ \cdot s r l_{p} + (1 - ϕ) \cdot s_{p} (e_{sent}, f_{sent})) \cdot (ϕ \cdot s r l_{r} + (1 - ϕ) \cdot s_{r} (e_{sent}, f_{sent}))}{β \cdot (ϕ \cdot s r l_{p} + (1 - ϕ) \cdot s_{p} (e_{sent}, f_{sent})) + (1 - β) \cdot (ϕ \cdot s r l_{r} + (1 - ϕ) \cdot s_{r} (e_{sent}, f_{sent}))}

where

β

is set to 0.7 to make YiSi more recall-oriented when used for machine translation evaluation. The term

s_{r} (e, f) = cos (v (e), v (f))

represents the cosine similarity in the embedding model between the lexical unit representation,

v (u)

, and the lexical unit itself u (where

e_{sent}

and

f_{sent}

are the vector representations of the reference and hypothesis translations, respectively).

The parameter

ϕ

is a weighting factor with a default value of 0.1, primarily used for experimental purposes. The terms

s r l_{p}

and

s r l_{r}

denote structural semantic precision (the proportion of semantic frames in the hypothesis translation that are present in the reference translation) and structural semantic recall (the proportion of semantic frames in the reference translation correctly aligned with those in the hypothesis translation), respectively.

Semantic frames correspond to sets of related concepts, and their inclusion in these calculations enables a more holistic evaluation, independent of the exact wording used in the hypothesis and reference translations.

The results produced by this metric range from 0 to 1, with 1 representing the highest possible score.

COMET (Crosslingual Optimized Metric for Evaluation of Translation) [50] is a deep learning-based evaluation metric that utilizes multilingual models to compare translations against both reference translations and source texts. This approach integrates semantic embeddings and contextual features to provide a more comprehensive and accurate assessment of translation quality.

To generate its final evaluation, COMET employs neural networks trained on large datasets containing human judgments of translation quality. These neural networks are specifically designed to assess translations in a manner analogous to human evaluators. The metric leverages embedding models to represent words and phrases within a vector space, capturing semantic similarities between the machine translation and the reference.

Moreover, COMET evaluates multiple aspects of translation quality, including fluency, adequacy, and meaning preservation. This multidimensional assessment enables a more detailed and precise evaluation compared to traditional metrics, which typically focus on isolated aspects of translations.

To perform the evaluation, COMET first computes a set of combined features based on the embeddings of the source text (s), the hypothesis (h), and the reference (r). These features are as follows:

Element-wise product with the source text ( $h \cdot s$ ).
Element-wise product with the reference ( $h \cdot r$ ).
Element-wise absolute difference with the source text ( $| h - s |$ ).
Element-wise absolute difference with the reference ( $| h - r |$ ).

These features are then combined with the embeddings of the reference and hypothesis to form an input vector x:

x = [h; r; h \cdot s; h \cdot r; | h - s |; | h - r |]

This vector is passed through a feed-forward neural network to minimize the mean squared error.

The final quality score of a hypothesis is computed as follows:

f (s, \hat{h}, r) = \frac{2 \cdot d (r, \hat{h}) \cdot d (s, \hat{h})}{d (r, \hat{h}) + d (s, \hat{h})}

where

d (u, v)

represents the Euclidean distance between the embeddings of u and v (e.g., hypothesis and reference). Finally, the result is normalized to obtain a value between 0 and 1.

BERTScore [51] is a metric that measures the semantic similarity between translations using pretrained contextual embeddings. To accomplish this, BERTScore converts both the hypothesis translation (

x = {x_{1}, x_{2}, \dots, x_{k}}

) and the reference translation (

\hat{x} = {{\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{k}}

) into vector representations.

Then, BERTScore computes the cosine similarity between each pair of vectors from both translations:

sim (x_{i}, {\hat{x}}_{j}) = \frac{x_{i}^{⊤} {\hat{x}}_{j}}{∥ x_{i} ∥ ∥ {\hat{x}}_{j} ∥}

Finally, BERTScore uses a greedy matching system, where for each word in the reference translation, it finds the most similar word in the hypothesis translation (Recall) and vice versa (Precision). These two values are then combined to produce the final score.

R_{BERT} = \frac{1}{| x |} \sum_{x_{i} \in x} max_{{\hat{x}}_{j} \in \hat{x}} sim (x_{i}, {\hat{x}}_{j})

P_{BERT} = \frac{1}{| \hat{x} |} \sum_{{\hat{x}}_{j} \in \hat{x}} max_{x_{i} \in x} sim (x_{i}, {\hat{x}}_{j})

F_{BERT} = 2 \cdot \frac{P_{BERT} \cdot R_{BERT}}{P_{BERT} + R_{BERT}}

The

F_{BERT}

score ranges between 0 and 1, with higher values indicating better performance.

All the methods described require a reference translation. However, evaluating translation quality without a reference translation is crucial, as it enables the assessment of coherence and semantic accuracy without relying on a pre-existing ideal translation. The primary distinction in interpreting the results lies in the focus of reference-free metrics on the semantic quality of the translation, whereas reference-based metrics provide a more comprehensive analysis by considering not only lexical similarities but also factors such as fluency, coherence, and context. Nevertheless, if the reference translation is inadequate, this type of evaluation may either underestimate the quality of the generated translation or, conversely, overestimate its quality if the reference is imperfect but aligns well with the metric’s expectations. For this reason, word embeddings have been employed to estimate translation quality without requiring a reference translation.

A word embedding is a method of representing words as numerical vectors. Instead of treating each word as an independent symbol (as in a traditional dictionary), word embeddings map words into a continuous vector space. These embeddings capture lexical and semantic information through models trained on large text corpora [52]. Such representations enable the comparison of words within a vector space, where the proximity between vectors reflects semantic similarity. Consequently, they facilitate the direct evaluation of translation quality based on a corpus or dictionary rather than relying on a predefined reference translation. When incorporated into a metric, as in the case of YiSi, word embeddings determine whether the words in the translation occupy the same “position” as those in an ideal human translation.

The application of word embeddings involves transforming words into numerical vectors within a fixed-dimensional space. This transformation is achieved using models such as Word2Vec [53], which employs neural network architectures to predict words based on surrounding context (CBOW model) or to predict context from a target word (Skip-Gram model); GloVe [54], which generates embeddings through matrix factorization of word co-occurrence data in a corpus; and FastText [55], which extends Word2Vec by incorporating subword information using n-grams, allowing the creation of representations for out-of-vocabulary words. The calculation of semantic similarity between words or phrases is performed using metrics such as cosine similarity.

To compare words across different languages, it is necessary to align the vector spaces of the word embeddings. This process is performed using an orthogonal transformation computed with the Orthogonal Procrustes technique [56]. Given two sets of vectors, X and Y (from the source and target languages, respectively), the transformation matrix W is obtained by minimizing the squared distance between the sets:

min_{W} {∥ W X - Y ∥}_{F}, such that W^{T} W = I,

(6)

where

{| \cdot |}_{F}

denotes the Frobenius norm, and W is an orthogonal matrix.

The resulting metric from this process ranges from 0 to 1, where 0 indicates the worst possible alignment, and 1 represents perfect alignment of the vector spaces between both languages.

Table 3 provides a summary of the studied evaluation metrics.

3. Materials and Methods

This section details the experimental design conducted to evaluate the performance of different selected tools in translating vulnerability descriptions.

3.1. Research Questions

The research questions posed in this study are as follows:

RQ1: Which translator provides the highest quality translations?
RQ2: Is the reference translation reliable?
RQ3: Which translator produces vulnerability description translations that most closely align with those reported by INCIBE?
RQ4: Are there differences between the performance of the translators depending on the size of the translated description?
RQ5: Are the translators efficient in terms of time and cost?
RQ6: Are GenAIs or specialized machine translators more efficient?

3.2. Datasets

For this study, a database provided by INCIBE (.json) has been used, which is available at [8]. The main characteristics of this database are summarized in Table 4.

From the above dataset, we have extracted a sample for experimentation. The main characteristics of the sample are included in Table 5. The reason for extracting a sample from the dataset is due to time and cost considerations in the experimentation process.

Also, we can divide the sample into short and long descriptions of vulnerabilities. Taking the median of 236 characters, there are 3772 vulnerabilities that we call short (number of characters

\in (0, 236]

), and 3742 descriptions categorized as long (number of characters

\geq 237

). We show two examples below.

The description of the vulnerability identified by the code CVE-2006-0533, containing 160 characters, serves as an example of a short description: Cross-site scripting (XSS) vulnerability in webmailaging.cgi in cPanel allows remote attackers to inject arbitrary web script or HTML via the numdays parameter. On the other hand, the description of the vulnerability identified by the code CVE-2006-1056, containing 606 characters, serves as an example of a long description: The Linux kernel before 2.6.16.9 and the FreeBSD kernel, when running on AMD64 and other 7th and 8th generation AuthenticAMD processors, only save/restore the FOP, FIP, and FDP x87 registers in FXSAVE/FXRSTOR when an exception is pending, which allows one process to determine portions of the state of floating point instructions of other processes, which can be leveraged to obtain sensitive information such as cryptographic keys. NOTE: this is the documented behavior of AMD64 processors, but it is inconsistent with Intel processors in a security-relevant fashion that was not addressed by the kernels.

3.3. Machine Translators and GenAI

For this experiment, it is essential to ensure that all conditions remain consistent, even if they could be optimized in a real-world application or production environment. To achieve this, the study has deliberately relied on the most basic implementations, avoiding associated costs and utilizing only fundamental resources. This approach ensures an objective and fair comparison of GenAI tools, eliminating external factors that could introduce bias into the results. Furthermore, by maintaining uniform testing conditions, the findings provide a reliable baseline for evaluating the inherent capabilities of each tool, independent of any additional optimizations that might be applied in a practical setting.

The tools evaluated in the study were ChatGPT, Bing Translator, DeepL, Google Translate, and Copilot. In Table 6, some of the most representative features of each GenAI are presented.

The free versions of all the tools mentioned above were used.

Although the inclusion of the Gemini tool was initially considered, it was ultimately excluded from the final tests due to the requirement of creating an account and its significant limitations.

For the purpose of comparing the efficiency of machine translators versus GenAI tools, we have created two distinct groups. The Translators Group consists of Google Translate, Bing Translator, and DeepL, all of which are specialized tools for machine translation. On the other hand, the GenAIs Group includes ChatGPT and Copilot, both of which leverage GenAI technologies that extend beyond just translation tasks, allowing them to adapt to a variety of linguistic needs.

3.4. Data Collection Methodology

During the translation generation process, several limitations were encountered, as summarized in Table 7, which influenced the design of the data collection methodology for the experiment.

For most of the translation tools used in the experiment, automated scripts were developed to emulate manual interaction with their interfaces. These scripts were designed to simulate mouse movements and utilize copy-paste commands, ensuring that they minimize detection by security systems while optimizing the retrieval of translations. The use of these scripts aimed to emulate a real-world scenario where users would interact with the tools in a similar manner. Below, we detail the implementation for each tool:

ChatGPT: The GPT-4o-mini model was used due to the limitations of the free version of GPT-4. The script incorporated strategies to handle temporary message blocks, conversation length limits, and auto-scroll issues. Additionally, automatic pauses and session restarts were implemented to ensure process continuity. To optimize performance, a second device was employed, and the processing order of the original file was modified.
Bing Translate and Google Translate: Scripts were developed to simulate mouse movements, avoiding direct web scraping.
DeepL: Translation automation was achieved through copy-pasting into the DeepL interface. To prevent service blocks, periodic tab resets were implemented every 10 translations.
Copilot: A script was developed to automate the insertion of translations into Copilot. The script simulated mouse movements and used copy-paste commands to manage and transfer translations automatically into the system, incorporating strategies similar to those used for ChatGPT to handle temporary message blocks. However, for this tool, due to the inability to detect when a prompt had been answered, fixed waiting times were established for each interaction.

During tests without a database, the prompts included in Table 8 were used to manage the translation of vulnerabilities.

3.5. Response Variables

The response variables analyzed in the experiment are as follows:

The vulnerability description translated into Spanish by each machine translator and GenAI, and the reference translation provided by INCIBE.
Evaluation of the translation using machine translation evaluation metrics.
Time: measurement of the time taken to complete the translations. In this context, the time measurement includes both the system’s active periods and the fixed waiting times, which are an integral part of the interaction process with the tool.

3.6. Evaluation Metrics

In order to evaluate the translations obtained without the necessity of a reference translation, several word embedding models were trained using the FastText library [55] with various available corpora. These models had previously been trained on large volumes of text to learn word relationships, particularly focusing on English-Spanish language pairs. The first model pair utilized all translations from the original dataset provided by INCIBE [8], which included vulnerabilities and their corresponding translations. The second pair employed a publicly available translation corpus, also provided by INCIBE [57]. Finally, the third model pair consisted of pretrained models, wiki.en.bin and wiki.es.bin, which are available within the FastText library and were developed as part of the MUSE research project [58] by Meta.

For the anchor word dictionaries of the first and second model pairs, a script was developed to identify repeated words within each pair of descriptions (English-Spanish) associated with the same vulnerability. This script follows a procedure to analyze and compare two text files in different languages, identifying common words and generating structured results. Initially, both files are processed simultaneously, with each line split into individual words and repetitions within the same line removed. A comprehensive word count is then performed in each file, recording the frequency of occurrences and the lines in which these words appear. Subsequently, words appearing in both lines of the files are identified as common words. These words are stored alongside their frequency of occurrence and the lines where they match in both texts. After all lines are processed, words that occur with identical frequency in both files are filtered, consolidating the relevant data. For the third model pair, an anchor word dictionary provided by the same sources as the models was used. After obtaining the three model pairs and their respective anchor word dictionaries, the script was executed for each of them. This script first extracts vector representations for a set of shared key words, referred to as anchors, in both languages. These anchors are then used to compute an orthogonal transformation matrix, which aligns the source language’s vector space with that of the target language, employing the Orthogonal Procrustes technique. For each sentence pair, the semantic similarity between the source text and its possible translations is computed by obtaining their respective vector representations and calculating the average of the word vectors that compose them. The vector spaces for each sentence pair are then aligned, and their similarity is measured using cosine similarity, leveraging tools from the SciPy library [59].

The selection of evaluation metrics with a reference translation (specifically, the translation provided by INCIBE) for the experiment was guided by the objective of ensuring a diverse range of aspects were assessed, including metrics of different types. Furthermore, the selection was influenced by resource constraints and the requests of the expert team, prioritizing metrics deemed most applicable for practical use. Initially, BERTScore and Trados Studio were considered; however, they were excluded from the evaluation due to their paid nature, high execution time, and request limitations. The final selection of the metrics is included in Table 9.

3.7. Analysis

To assess the quality of the translations generated by each GenAI and machine translator, a comprehensive evaluation was conducted using a combination of various indicators.

For measuring the quality of each translation, including those that will serve as the reference later, descriptive statistics were computed based on the application of word embeddings. The translator yielding the highest average similarity (the closest cosine similarity to 1) produces translations that are most faithful to the original meaning. A higher arithmetic mean of the embedding similarity indicates superior translation quality, as it suggests that the translations are, on average, more aligned with the original text’s semantic content. However, a high degree of variability in similarity scores (indicating a broad distribution of results) suggests a less reliable translator, as some translations may exhibit high quality while others diverge significantly from the intended meaning. A lower standard deviation signifies greater consistency in translation quality. If a translator demonstrates a high mean similarity but also a high standard deviation, it suggests that while certain translations may be of high quality, others may significantly differ from the original. In the presence of extreme values, such as when the median is lower than the mean, the median is employed as a more robust measure to avoid the influence of outliers. Finally, the maximum and minimum values also provide crucial insights into the system’s performance. A low minimum score for a translator may indicate that some of its translations are of suboptimal quality.

Given the three distinct word embedding models used (the INCIBE dataset, a public corpus, and pretrained models), it is critical to assess whether the results remain consistent across all models. If a translator performs well in all three models, it provides strong evidence that its translations are more accurate. Conversely, if a translator excels in one model but performs poorly in another, it may indicate that its translation style aligns more closely with specific types of texts (e.g., aligning better with INCIBE but less so with pretrained models). Therefore, the individual performance of each translator was evaluated based on the number of cases (vulnerabilities) in which it achieved the best result compared to other models. This metric, referred to as Victories, indicates the reliability of the translator. A translator with a high number of victories can be considered more reliable, and a consistently superior performance in the majority of cases strongly suggests better translation quality.

After confirming the quality of the reference translation provided by INCIBE and establishing that no tool can be excluded (as all translators achieved sufficiently high translation quality), we propose the following assessment framework: The evaluation system is designed to highlight translations that exhibit statistically significant differences and do not receive similar evaluations across different metrics, thereby indicating notable performance variations among the translation tools. This aspect is referred to as Consistency. To evaluate consistency, potential correlations and significant differences between translation evaluations were analyzed. The correlation analysis provides insights into whether translations are generally aligned in terms of overall quality. A high correlation between the two evaluations suggests that the translations are similar in global quality, implying they may share strengths or weaknesses in similar areas (e.g., fidelity to the original meaning or fluency). Conversely, a low correlation indicates that the translations differ considerably in terms of the assessed quality aspects. This could suggest that one translation excels in some areas while another may perform better in different contexts. The analysis of statistically significant differences in evaluation scores allows for the identification of objectively superior or inferior translations in terms of the evaluated aspects (such as fidelity, fluency, accuracy, and semantics). Failure to identify statistically significant differences implies that the metrics are unable to conclusively distinguish between translations.

In conclusion, the analysis reveals the following patterns in translation quality assessment: First, when a high correlation is observed alongside the absence of significant differences, the translations are likely to exhibit similar quality, reflecting comparable translation strategies. Second, when a low correlation and significant differences are present, the translations tend to display substantially different quality levels. Third, when a high correlation is coupled with significant differences, the translations demonstrate similar quality variations across the evaluated aspects, with one translation consistently outperforming the others. Lastly, when a low correlation is observed without significant differences, it suggests that the metric may not be effectively capturing the quality disparities between the translations.

To quantify consistency, the evaluation scores of each translation, using INCIBE’s translation as the reference, were compared using statistical inference methods. Initially, a normality test was conducted. If the result indicated normality, paired samples were compared using Student’s t-test. If the normality test failed, Wilcoxon’s test was employed. To assess the correlation between translation evaluation scores, Pearson’s correlation coefficient was calculated. A significance level of

α = 0.05

was chosen. Following the statistical analysis, a consistency score for each translator was computed as follows:

Translations that exhibit statistically significant differences in any of the comparisons are awarded one point, signifying that these translations demonstrate distinct performance according to the evaluated metrics. This suggests that one translation is significantly superior or inferior to another.
Translations that do not show significant differences are not awarded points, as there is insufficient evidence to suggest that their quality differs significantly based on the metrics.
Translations with significant and high correlations in the range [0.5, 1] do not receive any additional points, as such correlations imply that the translations do not differ substantially in terms of quality. The interval [−1, −0.5] is excluded, as only TER operates inversely, and its final evaluation has been adjusted accordingly.
Translations exhibiting correlations in the range (−0.5, 0.5) or non-significant correlations ( $p \geq 0.05$ ) are assigned one point. These points are not awarded more than once.

This scoring system is based on the performance differences between translations, but these differences do not directly indicate that one translation is inherently better than another. Instead, they highlight variations in translation behavior according to the evaluated metrics. Translations with more significant differences may indicate poorer quality in some cases. To complement this evaluation, a ranking is provided, which assesses adherence to the translation style presented by INCIBE. This ranking is calculated based on the average scores obtained from all metrics, using INCIBE’s translations as the reference. With these average scores, we have also computed the possible differences in the performance of the translators between short and long descriptions of vulnerabilities.

3.8. Technical Details

The implementation was conducted in a controlled environment to ensure consistency and reproducibility. The computational setup consisted of the following hardware and software specifications:

Hardware Configuration:

Number of devices: 5 PCs;
CPU: Intel Core i5-8400 (Intel, Santa Clara, CA, USA) (6 cores, 6 threads, base clock 2.8 GHz, turbo up to 4.0 GHz);
Memory: 8 GB RAM per device.

Software Environment:

Operating Systems: Windows and Linux;
Programming Language: Python 3.

All computations and network operations were executed from a single IP address.

4. Results and Discussion

4.1. RQ1 and RQ2: Quality of the Translations

The descriptive results of the models trained using the word embeddings are presented in Figure 1 and in Table 10.

As observed, all translators provide translations that meet a sufficiently high quality threshold (≥0.8), with some even surpassing the score achieved by the reference translation from INCIBE. Therefore, it can be concluded that no translator should be dismissed a priori. Additionally, this finding affirms that the reference translation itself demonstrates a sufficiently high level of quality. However, a recurring trend emerges, indicating that the pretrained models consistently yield superior results, whereas those utilizing the corpus tend to perform less favorably. This prompts a reconsideration of the corpus quality used in the translation process.

Figure 2 illustrates the total number of victories across the three word embedding models for each translator and the INCIBE translations. As shown, Bing achieves a significantly lower number of victories compared to the other systems. Conversely, the translations from INCIBE, ChatGPT, and Copilot attain a similar and relatively higher number of victories, suggesting that these systems exhibit greater stability in translation quality.

4.2. RQ3: Accuracy of the Translation Style

Once the translations have been assessed individually and the reliability of the reference translation has been confirmed, the next objective is to identify which translator produces translations that most closely align with the desired style.

The p-values obtained from the pairwise comparisons for paired samples are presented in Table 11.

The p-values from the Pearson correlation test that uses the INCIBE translations as a reference are

p < 0.001

in all cases.

The Pearson correlation coefficients obtained can be found in Table 12.

As observed, when counting the instances in which a translator’s output demonstrates statistically significant differences or exhibits low or no correlation, the resulting data are presented in Table 13.

According to the difference count, DeepL and Copilot stand out compared to the others. To determine whether this is because they are more similar to INCIBE’s translations or because they deviate more, their results have been cross-checked by a ranking. This ranking was obtained by summarizing the evaluation metric results for each translation. If we look at the average score of each translator in Table 14, we could conclude that DeepL and Google Translate achieve the best overall performance considering all metrics, closely followed by ChatGPT.

Upon a more detailed analysis of the alignment with INCIBE, it becomes evident from the ranking presented in Table 15 that translations produced by DeepL and Google Translate exhibit a closer alignment with the reference text, particularly in terms of structure and vocabulary. These tools also achieve the highest scores in both COMET and YiSi, suggesting that their translations are more fluent and semantically accurate. In terms of word order, all translators demonstrate relatively high values; however, ChatGPT and DeepL stand out as performing slightly better.

Overall, DeepL and Google Translate emerge as the most effective tools, offering a balanced combination of lexical accuracy, fluency, and semantic fidelity. ChatGPT also delivers strong results, though it may exhibit minor deficiencies in word order and structural metrics. Bing, while yielding acceptable results, ranks lower than the top three translators. Copilot receives the lowest ratings across all evaluation metrics, indicating a lower degree of precision and fluency in its translations.

4.3. RQ4: Effect of the Size of the Description in the Performance of the Translation

We have included the results of the average value of the metrics comparing the short and long descriptions in Table 16.

The results presented in Table 16 reveal a consistent performance gap between short and long descriptions across all evaluated translation systems and metrics. In general, short descriptions achieve higher scores in all metrics, including BLEU, COMET, and METEOR, among others. This suggests that machine translation systems are more effective when handling concise inputs, likely due to reduced syntactic complexity and fewer dependencies that need to be preserved across languages. Conversely, long descriptions, which typically involve more technical detail, embedded clauses, and domain-specific terminology, pose greater challenges for accurate translation. As a result, the observed drop in performance for longer texts may have implications for the practical application of these tools in cybersecurity contexts, where timely and accurate translation of detailed vulnerability reports is crucial. These findings underscore the importance of refining translation strategies based on input length and complexity and open new research opportunities in adaptive translation workflows, hybrid human-AI revision models, and the development of specialized models for long-form technical content.

4.4. RQ5: Time Complexity

The average extraction time in seconds for the translations from each translator system is presented in Table 17.

4.5. RQ6: GenAI or Machine Translators?

In this analysis, the results are categorized based on their origin, either from a machine translator or a generative AI system (GenAI). Table 18 presents the outcomes of the word embedding test, which are organized according to this classification, alongside the results from INCIBE’s translations.

The META model emerges as the most rigorous in terms of semantic similarity. In contrast, the Corpus (INCIBE) exhibits greater variability in its evaluations, suggesting a heightened sensitivity to variations in the input data. According to the DB (INCIBE) model, GenAI achieves the highest average quality in terms of semantic similarity; however, its variability is notably broader compared to the INCIBE translations. GenAI continues to outperform other systems in the Corpus (INCIBE) model, yet INCIBE translations demonstrate greater consistency than those produced by machine translators. In the META model, INCIBE achieves both the highest mean score and the lowest variability, while GenAI struggles with consistency despite securing a greater number of absolute victories. It can be inferred that the META model favors the INCIBE reference translation, likely due to the closer alignment of its training or reference data with those used in the INCIBE corpus. If DB (INCIBE) serves as the reference database containing all translations, it is logical that the scores would be higher in this model as its evaluations would be more closely aligned with the original reference. Nevertheless, the META model overall produces the highest mean scores. If META employs a broader and more generalized vector space, it may be more robust and capable of identifying semantic similarities with greater precision, thus generating higher scores across the board. Conversely, if DB (INCIBE) utilizes a more restricted vector space or is biased toward a specific type of translation, it may struggle to capture variations introduced by GenAI and machine translators, potentially leading to lower scores for translations that alter word order and structure while preserving semantic meaning.

Another notable observation is that GenAI exhibits the highest variability within the META model, indicating greater fluctuations in its scores. META appears to be more sensitive to substantial differences in translations generated by GenAI, while both DB (INCIBE) and Corpus (INCIBE) impose stricter penalties for structural changes, with META being more accommodating in this regard. This suggests that GenAI translations occasionally align closely with the original semantic meaning, yet at other times, they deviate significantly, which accounts for the broader variability range observed.

Regarding GenAI’s inferior performance in the META and style adjustment tests but superior results in the other two models, it can be inferred that the META model is more stringent in assessing semantic similarity, thus imposing harsher penalties on GenAI translations, particularly when deviations in structure or word order occur. In contrast, the other models (DB and Corpus) may be less sensitive to such variations, allowing GenAI translations to perform better in terms of overall semantic similarity. Additionally, when compared to INCIBE, GenAI’s higher performance in the other two models likely reflects its ability to align more closely with the general semantic meaning, even if it occasionally struggles with consistency in structure—an aspect that is crucial for the evaluation within the META model.

Furthermore, the results of the style adjustment test to align with INCIBE, employing the new grouping, are provided in Table 19. The corresponding ranking can be found in Table 20.

The results are unequivocal. GenAI systems generate translations that, across all metrics, exhibit significantly lower alignment with the INCIBE style in comparison to the translations produced by machine translators.

On average, machine translators outperform GenAI across all metrics, demonstrating a higher degree of similarity to the reference translation. In the similarity ranking with INCIBE, machine translators consistently occupy the top position across all metrics, while GenAI ranks second in each case. The results from BLEU, GLEU, CHRF, and METEOR confirm that machine translators exhibit greater fidelity to the structure, order, and vocabulary of the reference translation. In terms of more advanced metrics, such as COMET and YiSi, machine translators continue to surpass GenAI, indicating not only a closer structural alignment but also a more accurate preservation of the text’s overall meaning.

Although it is commonly believed that GenAI produces more natural and semantically enriched translations, the results of this study suggest that machine translators more accurately convey the reference meaning, as measured by these metrics. This discrepancy may be attributed to traditional machine translators (such as DeepL, Google Translate, and Bing) being specifically trained on aligned translation corpora optimized for reference text matching. In contrast, GenAI models (such as ChatGPT and Copilot) are not exclusively trained for translation tasks but for text generation, which may lead to more creative outputs that deviate from word-for-word alignment with the reference.

Given that this evaluation emphasizes fidelity to the reference, GenAI’s inherent flexibility may result in penalization. Even though GenAI-generated translations may be equally valid, if they use alternative wording, their scores are reduced, regardless of the overall translation quality.

4.6. Discussion and Interpretation

The analysis revealed that DeepL and ChatGPT were the most consistent in terms of fidelity to the reference translation provided by INCIBE. DeepL, in particular, emerged as the most effective in preserving INCIBE’s style, achieving the highest number of “wins” compared to the other translators, indicating a more robust overall performance. In contrast, ChatGPT represents an efficient, cost-free alternative, demonstrating competitive consistency. Although Copilot performed well in certain tests, it displayed inconsistencies, suggesting the need for adjustments to improve its reliability. Bing Translator yielded the lowest results in terms of adherence to the reference translation.

Moreover, variability among translators, particularly in how they handle context, significantly influences the results. This variability could explain why Copilot, while more efficient for free-form translation tasks, may require specific adjustments or training to align more closely with predefined reference translations.

In conclusion, machine translators outperform generative AI systems across all evaluation metrics when the objective is to produce a translation that closely matches the INCIBE reference. While GenAI tools excel at generating more natural and contextually adaptive translations, their inherent flexibility limits their effectiveness in this particular task.

Furthermore, as anticipated, the quality of the corpus significantly impacts the evaluation of translations using word embeddings. Across all models, GenAI exhibits greater variability, suggesting that while it can generate translations, it also produces less consistent results. Machine translators, on the other hand, occupy a middle ground, demonstrating more homogeneous performance without achieving significant superiority in any particular model.

As limitations, the extension of the dataset may affect the results. The performance of the machine translation systems either degrades or improves depending on the scenario. It would be necessary to expand the dataset without computational or time constraints, in order to draw more generalizable conclusions. However, we believe that the results presented in this study can serve as a solid starting point for designing a more comprehensive experiment in terms of resources, translation systems, and types of descriptions analyzed, taking into account the findings reported here. Moreover, although human judgment is undoubtedly crucial for evaluating translation quality, the very absence of a human agent is precisely what this work seeks to assess. In this context, relying on a manual process not only for generating translations but also for reviewing them would be counterproductive in terms of time and cost, especially given the large and continuous volume of translations required. In any case, it is important to note that the dataset includes the version translated by INCIBE, which has already been revised by professional human translators. By measuring how closely the generated translations align with INCIBE’s official version, we aim to incorporate the human factor into the evaluation process.

Another potential limitation of the study is explained as follows: Automation in this domain is feasible because the early vulnerability alert service simply provides a Spanish replica translation of the vulnerabilities published by NIST. On the other hand, it is not an incident response service; it is a translation service. The high sensitivity of the alerts is defined by other standards, which follow their own procedures and standardized taxonomies. As previously explained, the goal is not to fully automate the entire process but rather to streamline the translation phase while ensuring a minimum guaranteed level of quality. For this reason, it would indeed be necessary to periodically assess the quality of the translations to ensure they remain at an adequate level. However, it is not necessary to evaluate translation performance in real time. In cases where the translations do not meet an acceptable standard or when a large volume of new vulnerabilities requires highly technical language, the most appropriate solution would be to explore the retraining of the automatic translation systems. In such a case, the best approach would be to use a GenAI model that can be enriched with relevant databases.

5. Conclusions

The early warning of vulnerabilities plays a critical role in protecting businesses, institutions, and organizations by enabling proactive risk mitigation and response strategies. The precise translation of these alerts into Spanish enhances accessibility, ensuring that key stakeholders can swiftly comprehend and respond to critical security threats. This process not only strengthens cybersecurity awareness but also facilitates the effective dissemination of knowledge within the broader community, fostering a more resilient and informed ecosystem against emerging threats. The integration of machine translation tools can further enhance the efficiency of automatically translating vulnerability alerts.

The paper presents a contribution to the field of cybersecurity and linguistics by providing a comprehensive evaluation of GenAI and machine translation tools in the context of translating cybersecurity vulnerability alerts. This study evaluates the performance of various machine translation systems, including generative AI tools, in translating vulnerability descriptions.

The results show that, overall, dedicated translation tools such as DeepL and Google Translate tend to offer slightly higher structural accuracy and style consistency. However, generative AI tools like ChatGPT and Copilot demonstrate remarkable adaptability and produce fluent, natural-sounding translations, especially for shorter, less technical descriptions. The methodology adopted in this study—based on the use of diverse evaluation metrics and embedding-based semantic similarity models—has enabled a nuanced and multidimensional analysis of translation quality. It also highlighted significant differences in how tools handle short versus long descriptions, revealing that longer, more complex entries tend to degrade translation performance more noticeably across all systems. The main advantage of using generative AI in this context lies in its flexibility, scalability, and the ability to integrate it into broader AI-powered workflows. Despite not being specifically designed for translation, GenAI tools offer competitive performance while facilitating automation and reducing human workload—particularly useful in time-sensitive cybersecurity environments. Furthermore, their prompt-based nature makes them highly customizable, opening the door to domain-specific fine-tuning. Compared to other tools currently on the market, the proposed approach demonstrates strong feasibility: most of the tested tools are publicly available, require minimal technical setup, and support automation (with limitations). While licensing constraints and lack of API access pose challenges for full automation in real-world secure environments such as INCIBE, the results confirm that a hybrid model—leveraging GenAI for initial translation and traditional tools or human revision for validation—may offer an optimal balance between efficiency and quality.

In summary, this work not only provides a practical benchmark for selecting machine translation tools in the cybersecurity field but also contributes to emerging research lines at the intersection of natural language processing, multilingual threat intelligence, and human-AI collaboration. Future research may focus on prompt engineering strategies, fine-tuning GenAI models for security-specific content, or exploring real-time implementation in secure operational settings.

Future research should focus on expanding the dataset, optimizing prompts, and enhancing translations through data enrichment. We propose the hypothesis that cybersecurity-trained translation models could achieve greater precision and coherence. Comparing general-purpose tools with specialized databases may significantly improve semantic fidelity, making domain-specific generative AI systems more effective for translating vulnerability alerts. Also, a broader study encompassing various types of cybersecurity texts could indeed offer a more comprehensive view of the role of GenAI in this domain. Finally, another line of future research involves newer or updated generative models to compare against current tools.

Author Contributions

Conceptualization, N.D.-G.; methodology, all authors; software, J.R.M., D.T.R., M.E.O.C. and I.S.E.; validation, J.R.M., D.T.R., M.E.O.C. and I.S.E.; formal analysis, all authors; investigation, J.R.M., D.T.R., M.E.O.C. and I.S.E.; resources, N.D.-G.; data curation, J.R.M., D.T.R., M.E.O.C. and I.S.E.; writing—original draft preparation, all authors; writing—review and editing, all authors; visualization, all authors; supervision, N.D.-G.; project administration, N.D.-G.; funding acquisition, N.D.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research is part of the Strategic Project Data Science for an Artificial Intelligence Model in Cybersecurity (C073/23), resulting from the collaboration agreement between the The Spanish National Cybersecurity Institute (INCIBE) and the University of Leon. This initiative is carried out within the framework of the Recovery, Transformation, and Resilience Plan, funded by the European Union (Next Generation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset that is used in this experiment is available in [8].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GenAI	Generative Artificial Intelligence
MT	Machine Translations
LLM	Large Language Model

References

NIST. Generative Artificial Intelligence. Available online: https://csrc.nist.gov/glossary/term/generative_artificial_intelligence (accessed on 11 November 2024).
Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative artificial intelligence: A systematic review and applications. Multimed. Tools Appl. 2024. [Google Scholar] [CrossRef]
Cohen, N.D.; Ho, M.; McIntire, D.; Smith, K.; Kho, K.A. A Comparative Analysis of Generative Artificial Intelligence Responses from Leading Chatbots to Questions About Endometriosis. AJOG Glob. Rep. 2025, 5, 100405. [Google Scholar] [CrossRef] [PubMed]
NIST. National Institute of Standards and Technology. 2025. Available online: https://www.nist.gov/ (accessed on 15 January 2025).
INCIBE. INCIBE Vulnerabilities. Available online: https://www.incibe.es/incibe-cert/alerta-temprana/vulnerabilidades (accessed on 28 February 2025).
Lee, T.K. Artificial intelligence and posthumanist translation: ChatGPT versus the translator. Appl. Linguist. Rev. 2024, 15, 2351–2372. [Google Scholar] [CrossRef]
Iorliam, A.; Ingio, J.A. A Comparative Analysis of Generative Artificial Intelligence Tools for Natural Language Processing. J. Comput. Theor. Appl. 2024, 1, 311–325. [Google Scholar] [CrossRef]
INCIBE. Vulnerability Database. Available online: https://drive.google.com/file/d/1tJOP8R98DcEMCNh6whXkqTsXGFpoTYU2/view?usp=sharing (accessed on 28 February 2025).
National Institute of Standards and Technology. Risk Management Guide for Information Technology Systems; Technical Report NIST SP 800-30 Rev. 1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2012. [Google Scholar]
IBM. Cost of a Data Breach Report 2024. 2024. Available online: https://www.ibm.com/reports/data-breach (accessed on 28 February 2025).
MITRE Corporation. Common Vulnerabilities and Exposures (CVE)—Introduction. 2021. Available online: https://cve.mitre.org/docs/cve-intro-handout.pdf (accessed on 5 February 2025).
Farrell, M. Survey on the Use of Generative Artificial Intelligence by Professional Translators. In Proceedings of the 46th Conference Translating and the Computer, Luxembourg, 18–20 November 2024. [Google Scholar] [CrossRef]
Fu, L.; Liu, L. What are the differences? A comparative study of generative artificial intelligence translation and human translation of scientific texts. Humanit. Soc. Sci. Commun. 2024, 11, 1236. [Google Scholar] [CrossRef]
EAMT. European Association for Machine Translation. 2025. Available online: https://eamt.org (accessed on 16 January 2025).
Hutchins, W.J. Machine Translation: Past, Present, Future; Academic Press: Chichester, UK, 1986. [Google Scholar]
Koehn, P. Statistical Machine Translation; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
DeepL SE. DeepL. Available online: https://www.deepl.com/es/translator (accessed on 28 February 2025).
Google. Google Translate. Available online: https://translate.google.es/?hl=es&sl=fr&tl=es&op=translate (accessed on 28 February 2025).
Microsoft. Bing Translator. Available online: https://www.bing.com/translator?to=es&setlang=es (accessed on 28 February 2025).
Chapsvision. Systran. Available online: https://www.systransoft.com/es/ (accessed on 17 February 2025).
Softissimo. Reverso. Available online: https://www.reverso.net/traducci%C3%B3n-texto (accessed on 17 February 2025).
OpenAI. ChatGPT. Available online: https://openai.com/chatgpt (accessed on 28 February 2025).
Google. Google Bard (Gemini). Available online: https://gemini.google.com/?hl=es-ES (accessed on 28 February 2025).
Microsoft. Microsoft Copilot. 2023. Available online: https://copilot.microsoft.com/ (accessed on 5 January 2025).
Google. Google Trends. 2006. Available online: https://trends.google.es/trends/ (accessed on 5 January 2025).
Comisión Europea, ETranslation Info. Available online: https://language-tools.ec.europa.eu/documentation (accessed on 28 February 2025).
Comisión Europea, ETranslation API. Available online: https://webgate.ec.europa.eu/etranslation/public/requestApiKey.html (accessed on 28 February 2025).
Pangeanic. Pangeanic. Available online: https://pangeanic.com/es/ (accessed on 17 February 2025).
Chapsvision. Systran API. Available online: https://pangeanic.com/translation-technology/translate-easy (accessed on 28 February 2025).
Softissimo. Reverso API. Available online: https://www.npmjs.com/package/reverso-api (accessed on 28 February 2025).
Ministerio para la Transformación digital y de la función pública, Gobierno de España, ALIA—Artificial Inteligence. Available online: https://www.alia.gob.es/ (accessed on 17 February 2025).
Nidhaloff. DeepTranslator GitHub. Available online: https://github.com/nidhaloff/deep-translator (accessed on 28 February 2025).
Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; Rush, A. OpenNMT. Available online: https://opennmt.net/ (accessed on 28 February 2025).
vince62s. Eole-nlp. Available online: https://github.com/eole-nlp/eole?tab=readme-ov-file (accessed on 17 February 2025).
Natural Language Processing Group T.U. EOLE. Available online: https://github.com/eole-nlp/eole (accessed on 3 April 2025).
DeepSeek. DeepSeek. 2025. Available online: https://www.deepseek.com/ (accessed on 31 January 2025).
X. Grok. 2023. Available online: https://x.ai/ (accessed on 8 January 2025).
Amazon. Amazon Nova. 2025. Available online: https://aws.amazon.com/es/ai/generative-ai/nova/ (accessed on 8 January 2025).
Gough, D.; Oliver, S.; Thomas, J. An Introduction to Systematic Reviews; SAGE: London, UK, 2012. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Popovic, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Isozaki, H.; Hirao, T.; Duh, K.; Sudoh, K.; Tsukada, H. Automatic Evaluation of Translation Quality for Distant Language Pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 944–952. [Google Scholar]
Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; pp. 223–231. [Google Scholar]
Lo, C.-k. YiSi—A Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, 1–2 August 2019; pp. 507–513. [Google Scholar] [CrossRef]
Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2685–2702. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Gupta, P.; Shekhawat, S.; Kumar, K. Unsupervised Quality Estimation Without Reference Corpus for Subtitle Machine Translation Using Word Embeddings. In Proceedings of the 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Newport Beach, CA, USA, 30 January–1 February 2019; pp. 32–38. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the Workshop at ICLR, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
Schönemann, P. A generalized solution of the orthogonal procrustes problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
INCIBE. Descriptions of Vulnerabilities in the NVD Database. 2019. Available online: https://elrc-share.eu/repository/browse/descripciones-de-vulnerabilidades-de-la-bbdd-nvd/9dd01c02cfcc11e9913100155d02670670070560807f458f8e50bd5273778d7a/ (accessed on 26 November 2024).
Conneau, A.; Lample, G.; Ranzato, M.A.; Denoyer, L.; Jégou, H. Word Translation Without Parallel Data. arXiv 2017, arXiv:1710.04087. Available online: https://github.com/facebookresearch/MUSE (accessed on 28 February 2025).
Developers, S. SciPy: Open Source Scientific Tools for Python. Available online: https://www.scipy.org/ (accessed on 26 November 2024).
Bird, S.; Loper, E. Natural Language Toolkit (NLTK). 2024. Available online: https://github.com/nltk/nltk (accessed on 11 November 2024).
chikiulo. YiSi. Available online: https://github.com/chikiulo/yisi (accessed on 26 September 2024).
Lightning AI. Torchmetrics: Machine Learning Metrics for PyTorch. 2024. Available online: https://lightning.ai/docs/torchmetrics/stable/ (accessed on 11 November 2024).
nttcslab nlp. RIBES. Available online: https://github.com/nttcslab-nlp/RIBES (accessed on 18 November 2024).
Unbabel. COMET. Available online: https://huggingface.co/Unbabel/wmt20-comet-da (accessed on 26 September 2024).

Figure 1. Average and standard deviation of value in word embedding of each translation. Axis X: translators. Axis Y: average.

Figure 2. Total number of wins in the word embeddings for each generative translater.

Table 1. Evaluated characteristics of machine translation tools.

No.	Characteristic	Description
1	Multitasking	Performs multiple tasks (translation, summarization, etc.).
2	API	Does it have an API? What features does it offer?
3	Automation	Does it allow automation of input and output?
4	Input Formats	Formats accepted as input.
5	Output Formats	Formats generated as output.
6	Scoring	Does it calculate scores? How does it compute them?
7	Customization	Can the tool be modified or programmed?
8	Data Sources	What sources does it use for translation?
9	Local/Online	Does it function locally or only online?
10	Proprietor	Owning company or developer.
11	Technology	Translation technologies used.
12	Generalist/Specialized	Is it a generalist or specialized tool?
13	Update Frequency	How frequently is it updated?

Table 2. Summary of machine translation review.

Machine Translation Systems (NMT/SMT)
Google Translate	Neural network-based translator
Bing Translator	Neural network-based translator
DeepL	Proprietary neural translator
Reverso	Translator based on hybrid models
Systran	Translator based on hybrid models
E-Translation	EU machine translation platform
Pangeanic	AI-based specialized translator
THUMT	Open-source neural translation system
OpenNMT	Open-source neural translation framework
Deep-translator	Library for multiple translators
Public Machine Translation Platforms
ALIA	Public machine translation service by the Spanish government
Generative AI Models
ChatGPT-3.5	Generative language model
Gemini	Google’s generative AI model
Copilot	AI assistant based on generative models
DeepSeek R1	Large language model by DeepSeek
Amazon Nova	Amazon’s generative AI model family
Grok-3	Generative AI model by xAI (Elon Musk)
Specialized AI-Powered Tools
Copilot for Security	AI-based security assistant

Table 3. Classification of machine translation evaluation metrics.

Metric	Type	Evaluation	Evaluated Aspects
BLEU	Statistical	Lexical	Based on n-gram precision, penalizes short translations.
GLEU	Statistical	Lexical	Similar to BLEU, but balances precision and recall.
CHRF	Statistical	Lexical	Uses characters instead of words, useful for morphologically rich languages.
METEOR	Statistical	Lexical and Semantic	Uses flexible alignment with synonyms and stemming, balancing precision and recall
RIBES	Statistical	Word Order	Based on word order, useful for languages with different syntactic structures.
TER	Statistical	Edit Distance	Calculates the number of edits required to transform the translation into the reference.
YiSi	Quantitative	Semantic	Uses semantic representations to measure similarity between translation and reference.
COMET	AI-Based Model	Semantic and Contextual	Neural network-based model that evaluates quality using contextualized embeddings.
BERT	AI-Based Model	Semantic	Measures similarity comparing translation and reference embeddings.
Word Embeddings	AI-Based Model	Semantic	Measures semantic similarity between sentences using vector representations.

Table 4. Information of each vulnerability in the database.

Characteristics	Content
Id	Identifier
CVEId	Common Vulnerabilities and Exposures
Title	Alert title
Description	Description in Spanish
Description (English)	Description in English
Dates	April 2006–6 June 2024
Instances	205,534
Characters in English	355.5990 per description (average)
Characters in Spanish	302.9201 per description (average)

Table 5. Sample information.

Dataset	D
Sample size (Instances)	7514
Total characters in English descriptions	1,958,403
Average characters per English description	260.6339

Table 6. Comparison of the selected translators.

Translator	Type of GenAI	Free	Number of Languages
ChatGPT (GPT-4o mini)	Conversational	Yes, with premium options	85
Bing Translator	Specific for MT	Yes, with premium options	179
DeepL	Specific for MT	Yes, with premium options	33
Google Translate	Specific for MT	Yes, with premium options	243
Microsoft Copilot	Conversational	Yes, with premium options	42

Table 7. Limitations encountered in the automation of data collection.

Translator	Limitations
ChatGPT	Blocked due to high message volume
	Limit on conversation length
	Issues with automatic scrolling (auto-scroll)
	Random tool blocks
	MAC address blocking
	Power supply interruptions
Google Translate and Bing	Character limitation (1000 Bing\|5000 Google)
	Performance degradation with consecutive requests
	Incorrect text capture of the translated content
DeepL	Maximum daily translation limit
	Limit on translation length
	Persistent service blocks
	Appearance of advertising pop-ups
	Disconnections due to power outages
	Slow processing due to excessive waiting times
Copilot	Maximum characters per prompt: 2000
	Maximum 10 prompts per conversation
	Limit on prompts per hour

Table 8. Prompts used for AI-generated translation samples.

GenAI	Prompt
Copilot	Translate the lines of this text into Spanish independently. Respond only with the translations.
ChatGPT	Translate the lines of this text into Spanish independently. Respond only with the translations.

Table 9. Evaluation metrics.

Metric	Implementation
BLEU [43], GLEU [44], METEOR [46], CHRF [45]	NLTK library [60] available in Python
YiSi [49]	GitHub repository [61] in C++ and Python subprocess library
TER [48]	Torchmetrics library [62], in Python
RIBES [47]	GitHub repository [63] and Python subprocess library
COMET [50]	COMET model in Hugging Face [64]

Table 10. Maximum, minimum, and number of victories in the word embedding analysis.

Model	Value	INCIBE	ChatGPT	Bing	DeepL	Goog.Trans.	Copilot
DB (INCIBE)	Maximum	0.9748	0.9748	0.9752	0.9739	0.9753	0.9748
	Minimum	0.7315	0.7125	0.6992	0.7231	0.7010	0.6957
	Victories	1058	1497	480	1528	1208	1743
Corpus (INCIBE)	Maximum	0.9729	0.9735	0.9737	0.9702	0.9727	0.9732
	Minimum	0.5891	0.5768	0.5869	0.5635	0.5768	0.5628
	Victories	1821	1297	558	1031	916	1891
META	Maximum	0.9607	0.9600	0.9588	0.9597	0.9602	0.9626
	Minimum	0.7653	0.7797	0.4900	0.4862	0.7837	0.4111
	Victories	1909	2064	834	525	913	1269

Table 11. p-values obtained from paired-sample tests (Student’s t-test or Wilcoxon test) when INCIBE translations serve as the reference.

	ChatGPT vs. Bing	ChatGPT vs. DeepL	ChatGPT vs. Google Translate	ChatGPT vs. Copilot	Bing vs. DeepL	Bing vs. Google Translate	Bing vs. Copilot	DeepL vs. Google Translate	DeepL vs. Copilot	Google Translate vs. Copilot
BLEU	$p < 0.001$	$p < 0.001$	$p = 0.354$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$
COMET	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p = 0.323$	$p < 0.001$	$p < 0.001$
GLEU	$p < 0.001$	$p = 0.021$	$p = 0.244$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$
RIBES	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$
TER	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$
YiSi	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p = 0.004$	$p < 0.001$	$p < 0.001$
METEOR	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p = 0.240$	$p < 0.001$	$p < 0.001$	$p < 0.001$
CHRF	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$	$p < 0.001$

Table 12. Pearson correlation coefficients using the INCIBE translations as reference.

	ChatGPT vs. Bing	ChatGPT vs. DeepL	ChatGPT vs. Google Translate	ChatGPT vs. Copilot	Bing vs. DeepL	Bing vs. Google Translate	Bing vs. Copilot	DeepL vs. Google Translate	DeepL vs. Copilot	Google Translate vs. Copilot
BLEU	0.730	0.657	0.743	0.584	0.681	0.762	0.503	0.714	0.437	0.525
COMET	0.831	0.772	0.854	0.549	0.811	0.898	0.505	0.840	0.471	0.515
GLEU	0.737	0.663	0.754	0.561	0.688	0.771	0.482	0.718	0.417	0.504
RIBES	0.773	0.676	0.777	0.369	0.699	0.790	0.329	0.681	0.280	0.331
TER	0.777	0.720	0.792	0.522	0.743	0.807	0.453	0.762	0.407	0.466
YiSi	0.760	0.653	0.762	0.427	0.695	0.799	0.378	0.724	0.316	0.390
METEOR	0.792	0.689	0.791	0.450	0.714	0.818	0.388	0.719	0.332	0.396
CHRF	0.754	0.640	0.774	0.527	0.724	0.825	0.431	0.745	0.353	0.453

Table 13. Count for the consistency test.

Translator	Significative Differences	Total
ChatGPT	30	33
Bing	31	37
DeepL	31	39
Google Translate	29	34
Copilot	31	54

Table 14. Values obtained for each evaluation metric when the reference translation is provided by INCIBE.

Translater	BLEU	COMET	GLEU	RIBES	TER	YiSi	METEOR	CHRF	Mean
ChatGPT	0.500	0.681	0.532	0.921	0.695	0.716	0.751	0.775	0.696
Bing	0.442	0.672	0.480	0.907	0.650	0.693	0.740	0.763	0.668
DeepL	0.511	0.691	0.539	0.919	0.693	0.725	0.757	0.785	0.703
Google Translate	0.503	0.700	0.532	0.919	0.689	0.725	0.756	0.782	0.701
Copilot	0.459	0.608	0.489	0.894	0.638	0.670	0.697	0.739	0.649

Table 15. Similarity ranking of the translations produced by each translator compared to those provided by INCIBE.

	BLEU	COMET	GLEU	RIBES	TER	YiSi	METEOR	CHRF
ChatGPT	3°	3°	3°	1°	1°	3°	3°	3°
Bing	5°	4°	5°	4°	4°	4°	4°	4°
DeepL	1°	2°	1°	2°	2°	2°	1°	1°
Google Translate	2°	1°	2°	3°	3°	1°	2°	2°
Copilot	4°	5°	4°	5°	5°	5°	5°	5°

Table 16. Means of the results obtained by each translator. Above are those corresponding to short descriptions. Below are those corresponding to long descriptions.

	Description	BLEU	COMET	GLEU	RIBES	TER	YiSi	METEOR	CHRF
ChatGPT	Short Long	0.511 0.489	0.716 0.645	0.544 0.520	0.926 0.916	0.705 0.684	0.722 0.710	0.762 0.740	0.780 0.770
Bing	Short Long	0.450 0.435	0.713 0.630	0.489 0.471	0.912 0.902	0.659 0.640	0.702 0.684	0.758 0.722	0.771 0.756
DeepL	Short Long	0.527 0.495	0.733 0.649	0.554 0.523	0.924 0.915	0.705 0.681	0.735 0.714	0.770 0.743	0.793 0.778
Google Translate	Short Long	0.524 0.483	0.739 0.661	0.552 0.512	0.926 0.912	0.706 0.672	0.739 0.711	0.774 0.738	0.792 0.772
Copilot	Short Long	0.470 0.448	0.644 0.573	0.501 0.478	0.898 0.889	0.647 0.629	0.676 0.663	0.707 0.686	0.742 0.735
Average	Short Long	0.496 0.470	0.709 0.632	0.528 0.501	0.917 0.907	0.684 0.661	0.715 0.696	0.754 0.726	0.776 0.762

Table 17. Extraction times for the translations from each IAG in seconds.

	ChatGPT	Bing	DeepL	Google Translate	Copilot
Time/Translation (s)	5.4055	1.4	7.0328	3.2	4.9948

Table 18. Word Embedding results with the 3 different models: DB is the INCIBE database, and Corpus is an INCIBE Corpus.

Model	Value	INCIBE	GenAI	Machine Translators
DB (INCIBE)	$\bar{X}$	0.8390	0.8484	0.8440
	S	0.0346	0.0364	0.0340
	Maximum	0.9748	0.9748	0.9753
	Minimum	0.7315	0.6957	0.6991
	Victories	1058	3240	3216
Corpus (INCIBE)	$\bar{X}$	0.7746	0.7846	0.7757
	S	0.0520	0.0556	0.0519
	Maximum	0.9729	0.9735	0.9736
	Minimum	0.5891	0.5628	0.5635
	Victories	1821	3188	2505
META	$\bar{X}$	0.9122	0.8952	0.9104
	S	0.0203	0.0824	0.0255
	Maximum	0.9607	0.9625	0.9601
	Minimum	0.7653	0.4111	0.4861
	Victories	1909	3333	2272

Table 19. Values obtained for each evaluation metric when the reference translation is provided by INCIBE.

	BLEU	COMET	GLEU	RIBES	TER	YiSi	METEOR	CHRF	Mean
GenAI	0.479	0.644	0.510	0.907	0.666	0.693	0.724	0.757	0.672
Machine translators	0.485	0.687	0.517	0.915	0.677	0.714	0.751	0.776	0.690

Table 20. Similarity ranking of the translations produced by GenAI and machine translators compared to those provided by INCIBE.

	BLEU	COMET	GLEU	RIBES	TER	YiSi	METEOR	CHRF
GenAI	2°	2°	2°	2°	2°	2°	2°	2°
Machine translators	1°	1°	1°	1°	1°	1°	1°	1°

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Román Martínez, J.; Triana Robles, D.; El Oualidi Charchmi, M.; Salamanca Estévez, I.; DeCastro-García, N. Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts. Appl. Sci. 2025, 15, 4090. https://doi.org/10.3390/app15084090

AMA Style

Román Martínez J, Triana Robles D, El Oualidi Charchmi M, Salamanca Estévez I, DeCastro-García N. Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts. Applied Sciences. 2025; 15(8):4090. https://doi.org/10.3390/app15084090

Chicago/Turabian Style

Román Martínez, Javier, David Triana Robles, Mouhcine El Oualidi Charchmi, Ines Salamanca Estévez, and Noemí DeCastro-García. 2025. "Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts" Applied Sciences 15, no. 8: 4090. https://doi.org/10.3390/app15084090

APA Style

Román Martínez, J., Triana Robles, D., El Oualidi Charchmi, M., Salamanca Estévez, I., & DeCastro-García, N. (2025). Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts. Applied Sciences, 15(8), 4090. https://doi.org/10.3390/app15084090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generative Artificial Intelligence and Machine Translators in Spanish Translation of Early Vulnerability Cybersecurity Alerts

Abstract

1. Introduction

2. Background

2.1. Early Warning of Vulnerabilities

2.2. Machine Translaters

2.3. Quality Assessment Indicators

3. Materials and Methods

3.1. Research Questions

3.2. Datasets

3.3. Machine Translators and GenAI

3.4. Data Collection Methodology

3.5. Response Variables

3.6. Evaluation Metrics

3.7. Analysis

3.8. Technical Details

4. Results and Discussion

4.1. RQ1 and RQ2: Quality of the Translations

4.2. RQ3: Accuracy of the Translation Style

4.3. RQ4: Effect of the Size of the Description in the Performance of the Translation

4.4. RQ5: Time Complexity

4.5. RQ6: GenAI or Machine Translators?

4.6. Discussion and Interpretation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI