Explainable Security Requirements Classification Through Transformer Models

Petrillo, Luca; Martinelli, Fabio; Santone, Antonella; Mercaldo, Francesco

doi:10.3390/fi17010015

Open AccessArticle

Explainable Security Requirements Classification Through Transformer Models

¹

Institute for Informatics and Telematics, National Research Council of Italy (CNR), 56124 Pisa, Italy

²

IMT School for Advanced Studies Lucca, 55100 Lucca, Italy

³

Institute for High Performance Computing and Networking, National Research Council of Italy (CNR), 87036 Rende, Italy

⁴

Department of Medicine and Health Sciences “Vincenzo Tiberio”, University of Molise, 86100 Campobasso, Italy

^*

Authors to whom correspondence should be addressed.

Future Internet 2025, 17(1), 15; https://doi.org/10.3390/fi17010015

Submission received: 20 November 2024 / Revised: 23 December 2024 / Accepted: 1 January 2025 / Published: 3 January 2025

(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)

Download

Browse Figures

Versions Notes

Abstract

:

Security and non-security requirements are two critical issues in software development. Classifying requirements is crucial as it aids in recalling security needs during the early stages of development, ultimately leading to enhanced security in the final software solution. However, it remains a challenging task to classify requirements into security and non-security categories automatically. In this work, we propose a novel method for automatically classifying software requirements using transformer models to address these challenges. In this work, we fine-tuned four pre-trained transformers using four datasets (the original one and the three augmented versions). In addition, we employ few-shot learning techniques by leveraging transfer learning models, explicitly utilizing pre-trained architectures. The study demonstrates that these models can effectively classify security requirements with reasonable accuracy, precision, recall, and F1-score, demonstrating that the fine-tuning and SetFit can help smaller models generalize, making them suitable for enhancing security processes in the Software Development Cycle. Finally, we introduced the explainability of fine-tuned models to elucidate how each model extracts and interprets critical information from input sequences through attention visualization heatmaps.

Keywords:

requirements classification; transformers; explainability

1. Introduction

Security requirements are the criteria or constraints that a system, application, or process must meet to protect its data, resources, and users. In addition, the nature of the system and the sensitivity of the data it processes may further determine a specific security requirement. However, the most considerable intent of these requirements is to minimize risks and vulnerabilities that could lead to unauthorized access, data breaches, or any other security-related incident. Non-security requirements, on the other hand, specify what functional and performance criteria the system should meet to accomplish its stated functional goals. They can include requirements about usability, scalability, reliability, and conformity with industry standards. While non-security aspects are essential for the system’s overall effectiveness and user satisfaction, they must be balanced wisely with the security features while designing a robust and resilient architecture. In contrast, non-security requirements address the functional and performance specifications a system must meet to achieve its intended objectives successfully. Some examples of such requirements include usability, scalability, reliability, or conformity with industry-specific standards. Of course, the non-security requirements must be balanced with security considerations in order to develop robust and resilient architecture. This becomes very important as organizations navigate through the complexities of modern-day technology and understand the interplay between security and non-security requirements, as well as develop comprehensive strategies that address protection from potential threats and enhance operational effectiveness and user satisfaction [1].

It is crucial to distinguish between security and non-security requirements among various software specifications. The ability to classify the latter is essential because it is helpful in the recollection of any security requirements in the earlier stages of development, resulting in better security in the final software solution [2]. Additionally, precise classification aids in resource allocation, enabling teams to dedicate appropriate attention and expertise to security and non-security aspects, thereby enhancing the overall quality and reliability of the software. However, the latter are typically articulated in natural language, and this form of representation suffers from some problems [3]. First, requirements can be incomplete, meaning they might not detail all those vital components of the system or project; this incompleteness could lead to ambiguities about what the system can conduct or the project’s scope. Second, requirements can be inconsistent. That is, the different parts of the requirement may, in fact, conflict with each other; hence, the meaning is not clear and is misinterpreted. Another common problem is redundancy. Some requirements repeat the same information, making the document unnecessarily long and increasing the risk of inconsistency. Lastly, requirements can be vague; ambiguous requirements cannot provide a clear understanding and may be interpreted differently by different stakeholders, which may cause misconceptions, resulting in possible gaps in the intended system or project outcome. To address these challenges, we propose a novel method for automatically classifying software requirements into security and non-security categories using transformer models. Our approach focuses on the shortcomings of the operational requirements analysis stage, employing transformer models to make the security requirements better focused and reasonably increase the security measures in software development by understanding language context and semantics. These models’ use is beneficial in this regard, thanks to their capabilities in Artificial Intelligence [4,5,6] and with particular regard to many Natural Language Processing (NLP) tasks [7,8,9]. This approach not only decreases the amount of manual work for requirement classification but also offers an automated classification approach suitable for any range of requirements or project scope. Further, we will describe the methodology used in our technique, reporting the training and validation stages and the outcome of experiments that confirm the usefulness of transformers for the automated classification of requirements, making them suitable for enhancing security processes in the software development cycle. Finally, we introduce explainability to understand how the proposed models perform the classification task and why, given an input sequence. In order to achieve this result, we used heatmaps to plot the attention mechanisms of the models, enabling the visualization of the importance given to each word by them.

The paper proceeds as follows: in the next Section we describe the related work with the current state of the literature, in Section 3 we present the methodology used in this work along with the datasets and models used; the results of the experimental analysis are presented in Section 4, while in Section 5, the latter are discussed. Finally, the conclusions and future research plans are drawn in the last section.

2. Related Work

In this section, we summarize the research on text classification using pre-trained transformer models and automatic classification of software requirements.

2.1. LLMs for Text Classification

In the work by Sun et al. [10], the authors proposed CARP (Clue And Reasoning Prompting) to address these issues by prompting LLMs to find superficial clues and induce a diagnostic reasoning process for final decisions. They describe the standard prompt-based ICL (In-Context Learning) paradigm, transforming text classification into generating predefined textual responses based on prompts. They also explore strategies for sampling demonstrations, such as random sampling, kNN sampling using SimCSE, and fine-tuned models. The authors introduce a progressive reasoning strategy involving clue collection, reasoning, and decision-making and discuss the preparation of clues and reasoning for all training examples in advance for few-shot learning. They evaluate CARP’s performance on five widely used datasets, showing that CARP outperforms existing methods and achieves new state-of-the-art results, demonstrating the approach’s effectiveness in low-resource scenarios and achieving comparable performance with significantly fewer training examples.

In the work [11], authors have fine-tuned various models (BERT family, GPT, ELECTRA, and XLNet) to classify human-written and AI-generated sentences. The models are tested on two datasets: one with 25,000 sentences and another with 22,929 abstracts. The proposed workflow includes training, validation, and an inference phase where sentences generated by ChatGPT and Wikipedia are classified. An explainability phase using integrated gradients and token importance techniques is also conducted. The fine-tuned models achieved high accuracy, precision, recall, and F1-score for the sentence dataset, indicating they effectively learned patterns to distinguish between human and AI-generated sentences. The research underscores the ethical implications of distinguishing AI-generated content from human-generated content. As AI systems become more sophisticated, it is crucial to promote transparency in AI-generated content and establish clear guidelines for attribution. This is essential to maintain trust in information sources and foster discussions about the ethical use of AI.

The study by Kant et al. [12] discusses training mLSTM and Transformer language models on a large 40GB text dataset and their application to text classification problems, precisely binary sentiment, and multidimensional emotion classification. For the emotion classification, they used a dataset of tweets labeled with eight categories plus three additional emotions for training and evaluations. They also created a dataset of relevant tweets to evaluate the models’ performance on domain-specific tasks. The tweets are labeled using human raters and employ an active learning technique to balance class distribution. The study demonstrates that unsupervised pretraining combined with fine-tuning provides a practical framework for complex text classification tasks. The Transformer model, in particular, shows strong performance when fine-tuned for specific tasks with noisy labels and specialized context. This approach offers a practical solution for niche text classification problems, making it accessible for academics and small organizations.

2.2. Automatic Requirements Classification

Dekhtyar et al. [13] employed TensorFlow-guided learning and Word2Vec-based representations for classifying requirements in software engineering documents. The study compares traditional Naïve Bayes methods with convolutional neural networks (CNNs) using random and pre-trained Word2Vec embeddings. The study aimed to evaluate the effectiveness of CNNs and the impact of Word2Vec embeddings on classification accuracy. Naïve Bayes classifiers were built using Scikit-Learn with word count and TF-IDF feature vectors for the baseline. The CNN models were implemented using TensorFlow, with specific configurations for embedding, convolutional, and pooling layers. The results showed that CNNs with Word2Vec embeddings significantly outperformed the Naïve Bayes baseline regarding precision and recall. For the SecReq dataset, the best CNN configuration achieved an F1-score improvement of over 7% compared to the baseline. However, in this work, the Naïve Bayes classifiers used as baselines already perform well on the chosen datasets. While the CNN classifiers with Word2Vec embeddings showed improvements, the extent of these improvements might be less pronounced on different datasets or with other baseline methods. In addition, they do not compare CNNs with other advanced classifiers beyond Naïve Bayes to determine relative performance across different methods. With 100 epochs, this work achieves a maximum F1-score value of 91.34%, while we, with only ten epochs, achieve a maximum F1-score value of 90%. We also conducted an exhaustive comparison of various machine learning models to thoroughly analyze their performance and identify the most effective ones for a given task.

The paper [14] explores the challenges and solutions for identifying security requirements in software development using machine learning (ML) techniques. The study aims to evaluate various ML-based classification algorithms to automate the identification process using the SecReq dataset. The authors propose an empirical study to assess the performance of 22 supervised ML classification algorithms and two deep learning approaches, LSTM and CNN. The study focuses on fast pre-processing techniques, such as word encoding and word embedding, to simplify the classification process without relying on complex linguistic and semantic rules. Results indicate that the LSTM network achieved the highest accuracy (84%) among non-supervised algorithms, while Boosted Ensemble achieved the highest accuracy (80%) among supervised algorithms. The study also analyzes the trade-off between accuracy and training time, highlighting that some classifiers, such as KNN-based algorithms, offer a good balance. In contrast to this previous work, our approach demonstrated a notable improvement in accuracy, reaching a remarkable 90% on the same dataset. We also employed different techniques to handle the dataset imbalance. Combining these strategies improved the model’s ability to generalize to unseen data and ultimately achieve a higher accuracy rate.

The study [15] presents an approach for automatically classifying security requirements into predefined groups using Natural Language Processing (NLP) transformers. The research aims to develop an NLP model capable of achieving at least an 80% F1-score in classifying security requirements by security objectives. The primary models used in this study include BERT, XLNET, and DistilBERT, with evaluation metrics such as precision, recall, and F1-score. The authors performed a binary classification on the SecReq dataset using DistilBERT since it is an efficient model, showing promising results compared to other models. The study demonstrates that deep learning models like DistilBERT can effectively classify security requirements with reasonable accuracy. The study highlights significant challenges in classifying security requirements using NLP transformers due to class imbalance, which, in this study, we tried to overcome using augmentation techniques showing promising results.

The paper by Alhoshan et al. [16] explores the application of Zero-Shot Learning (ZSL) for requirements classification in requirements engineering (RE). The paper proposes using embedding-based unsupervised ZSL for this task. They demonstrate this approach by classifying functional vs. non-functional requirements (FR/NFR), identifying non-functional requirements (NFR) classes, and classifying security vs. non-security requirements. The study shows promising F1 scores for these tasks with zero training efforts. Authors achieve the highest recall value for security requirements of 0.92. Although this work proposes a very interesting approach, as also indicated by the authors, unsupervised learning combined with Zero-Shot Learning (ZSL) demonstrates satisfactory performance in both binary and multi-class classification tasks, it falls short of surpassing supervised classification models.

3. The Method

As explained, in this work we perform a task that aims to classify security and non-security requirements. Figure 1 shows the workflow of the described process. We relied on a labeled dataset composed of a description of both security and non-security requirements in the natural language. Additionally, since the dataset, described deeper in Section 3.1, was imbalanced, we relied on three different data augmentation techniques. Also, to perform the classification task, we fine-tuned four pre-trained transformer models using the four datasets (the original one and the three augmented). Finally, we employed few-shot learning techniques by leveraging transfer learning models, explicitly utilizing pre-trained architectures using the four datasets.

3.1. Dataset

In order to conduct our experiments, we used the publicly available dataset SecReq introduced by Knauss et al. [17]. This work aimed to assist all steps in eliciting security requirements and provide mechanisms to trace security requirements from high-level security statements (security objectives) instead of secure design. This dataset contains 623 requirements, 415 security-related, and 208 non-security-related requirements (Table 1 shows an example per class). It is composed of three industrial specifications: (1) Common Electronic Purse (ePurse), (2) Customer Premises Network (CPN), and (3) Global Platform Specification (GPS). During the dataset analysis, we calculated the length distribution of the requirements, finding (as shown in Figure 2) that, on average, the non-security-related requirements are more extended (with an average number of characters of 183.52) than security-related requirements with an average length of 154.05. Additionally, we conduct a word frequency analysis showing the 20 most frequent words in the requirements. Figure 3 shows that the most used word is the verb “shall”, which, also seeing the example shown in Table 1, indicates a requirement or obligation; it is often used in legal, technical, or formal documents to express that something is mandatory or must be conducted, and for this reason in this context, this word is the most used. The latter verb is followed by “cng”, which is the acronym for “Common Name Group”. It categorizes and groups similar security requirements based on common characteristics or themes. The last word in the figure is “ngn”, which refers to “Next Generation Network”. These advanced telecommunications networks provide a wide range of services, including voice, data, and multimedia.

In order to fine-tune the models, we split the dataset into training, testing, and validation sets using the 80, 10, and 10 criteria. Table 2 shows the number of elements obtained from this split and the category distribution in each set.

3.2. Data Augmentation

As described in the previous section and as visible in Figure 4, the number of requirements for security (166) and non-security categories (331) are unbalanced. This data type could lead to a model that may perform poorly on underrepresented classes. This imbalance can result in biased predictions, as the model favors the majority class. To address this issue, data augmentation techniques can be employed to enhance the representation of minority classes. In this work, we relied on three approaches: word augmentation, back translation, and paraphrasing. Regarding word augmentation, this technique generally consists of generating new random variations of words or phrases in a text. However, in this work, we applied a strategy that relies on a pre-trained BERT model [18] to generate contextual embeddings and add for each requirement three new sentences by randomly adding, removing, or substituting words in the sentences. Using this approach, the model identifies a target word and suggests alternatives based on the surrounding context, ensuring the replacement maintains the original sentence’s semantic integrity. This process enriches the training dataset and helps us mitigate issues related to class imbalance. Instead, back translation is an augmentation technique used in natural language processing (NLP) that involves translating a sentence from the source to a target language and then back to the original language. In this phase, using Google Translate (https://translate.google.com/, accessed on 18 October 2024), we translate each sentence in the training dataset to a random language and translate it back to English. Finally, paraphrasing involves generating alternative versions of a given text while preserving its original meaning. In this process, we used three different Large Language Models to generate three different versions from the starting requirements. The first of these three models is based on the T5 (Text-to-Text Transfer Transformer) [19] architecture, which treats every NLP task as a text-to-text problem. The T5 model is pre-trained on a diverse range of tasks and can generate paraphrases by understanding the context and semantics of the input text. It excels in generating coherent and contextually relevant paraphrases. The second involved model is GPT-Neo [20], an open-source alternative to OpenAI’s GPT-3 and, with its 2.7 billion parameter version, can generate human-like text based on the input it receives. It can perform various tasks, including text generation, completion, and paraphrasing. Its large size allows it to capture complex language patterns, making it practical for generating diverse and creative paraphrases. Finally, Pegasus [21] is a model specifically designed for text generation tasks, particularly summarization and paraphrasing. It uses a transformer architecture and is pre-trained on a large text corpus. The variant used in this work is fine-tuned for paraphrasing tasks, allowing it to effectively produce paraphrases that maintain the original meaning while varying the wording and structure. We submitted each requirement to each model, indicating to the model to perform the paraphrasing task and adding each result to the new dataset. Table 3 shows an example of an original dataset requirement and the related variants generated using the abovementioned techniques. In contrast, Figure 5 shows the dataset size after each of the three processes that allowed us to rely on a more balanced dataset.

3.3. Models

This section describes transformers and the advantages of their fine-tuning task. In addition, we provide a brief description of the models used in this work.

3.3.1. Fine-Tuning

Transformer models are advanced artificial intelligence systems [4,22,23,24] designed to understand, generate, and manipulate human language. They are built on deep learning architectures, especially transformer models, for text data collection and analysis. Transformers are particularly popular since they quickly execute various tasks, such as text classification, sentiment analysis, translation, text summarization, and many others. Transformers can be particularly powerful in text classification tasks because they model the contextual relations given words or phrases much better than traditional approaches. This is mainly because knowledge is derived from a lot of diverse text data, which makes these models adapt very quickly to new tasks with only a slight modification. Fine-tuning them means adapting them to specific domains and enhancing their performance on relevant tasks and terminology. This process improves accuracy and relevance in predictions and responses while also helping to reduce biases by exposing the model to more diverse and representative data. In this work, we fine-tuned four models: BERT, DistilBERT, RoBERTa, and XLNET.

BERT (Bidirectional Encoder Representations from Transformers) [18] is a revolutionary model that popularized the use of the bidirectional training of transformers. Specifically, in contrast to constitutive models, which tackle the text in one direction only (either left-to-right or right-to-left), BERT takes into account both contexts at once. This leads to a more comprehensive grasp of the languages and interconnections between words. Indeed, BERT is trained on a large text corpus using two tasks, masked language modeling and next-sentence prediction. The model is now a state-of-the-art language representation model and was successfully applied to different text classification tasks.

DistilBERT [25] is a smaller, faster, and lighter version of BERT, created via a process called knowledge distillation. It retains 97% of BERT’s language understanding capabilities, yet it is 60% faster and requires less memory to run. This makes DistilBERT a very enticing option when working with applications with scarce computation resources or if a real-time response is necessary. While smaller, DistilBERT performs competitively on many NLP tasks, including text classification.

RoBERTa (A Robustly Optimized BERT Pretraining Approach) [26] represents an advance from BERT. Researchers at Facebook AI developed it to improve the capabilities of the BERT model by modifying its architecture and training methodology. This was conducted by removing the next sentence prediction objective and training on a larger dataset with longer sequences. It also uses dynamic masking during training, making learning contextual representation possible. These further improvements enable RoBERTa to establish state-of-the-art performance on many benchmarks, including GLUE, RACE, and SQuAD, as well as substantial improvements in text classification tasks. By better capturing context and relationships within the text, RoBERTa improves accuracy and effectiveness in categorizing and understanding complex textual data, outperforming its predecessor, BERT.

XLNet [27] is a pre-trained model that effectively incorporates the strengths of BERT and autoregressive models like GPT by leveraging bidirectional context and permutation of input sequences. This allows XLNet to capture all possible word orders, which provides a much finer understanding of word relationships in flexible ways. This flexibility, in particular, makes the model capture the meaning and contextual dependencies much better when working with text classification tasks, providing much higher accuracy and performance in text categorization.

3.3.2. Few-Shot Learning

Additionally, in this work, we relied on few-shot learning. It is a machine-learning paradigm where only a few examples of a specific class are required for the model to understand a task correctly. This is particularly helpful when creating massive datasets that are not feasible. Recent advances in LLMs have incorporated this practice by training a model on the bulk of the text and making it capable of performing various tasks based on limited input. Few-shot learning is designed to have models learn and generalize from just a few examples. Thus, it is a helpful strategy in a multitude of scenarios, especially in NLP and computer vision [28]. Few-shot fine-tuning is almost always initiated with an off-the-shelf pre-trained model, usually a transformer language model (BERT, GPT, DistilBERT). In this respect, it is in sharp contrast with classical modes of fine-tuning, which usually assume the availability of large-sized labeled datasets to reach acceptable performance. This is achieved by leveraging the knowledge encoded in the pre-trained model, allowing it to make informed predictions based on the few examples provided.

Specifically, in this work, we relied on SetFit [29], a framework that aims to improve few-shot learning for text classification tasks by combining the strength of fine-tuning and pre-trained large language models. The framework operates on a two-part approach: first, it encodes text using a pre-trained language model; second, it creates embeddings for the text and the set of labels for the task of interest. The embeddings are trained on a limited number of labeled samples for a specific purpose to allow the model to capture the class separation clarity needed for the classification. It leverages contrastive learning, a technique where the model is trained to distinguish between similar and dissimilar examples in the embedding space by making subtle distinctions between different classes based on how they relate. At the same time, other approaches may use many more labels without guaranteeing good performance. This is very efficient in practice, considering many practical scenarios have limited labeled data. SetFit uses pre-trained language models combined with minimal forecasting to alleviate the usual data requirements and computing costs needed to train LLMs while achieving excellent results for classification tasks such as sentiment exploitation [30] and topic categorization [31].

3.4. Explainability

As AI systems reach a higher degree of sophistication, concern has been raised about how their internal processes work. To address this, the idea of explainable AI is gaining importance. AI explainability refers to the methods and techniques used to make the operations and outputs of AI systems understandable to humans. It aims to explain the inferences made by AI models; the advancement enhances trust, accountability, and transparency. Large Language Models, on the other hand, are trained on immensely scaled datasets composed of complex neural architectures and have the ability to comprehend and produce human languages efficiently. Explainability, however, is challenged by the complexity of their architecture and the volume of data they work with. There are several important considerations as to why it is vital to comprehend how transformers operate. The first point pertains to the user’s trust in the models and their outputs. The second point suggests that the model can assist in detecting and correcting the imbalance in the training data, thus preventing any biases.

In this work, we used attention visualization, which is a method to examine the behavior of attention mechanisms located in neural networks, especially in models such as Transformers and LLMs. Attention mechanisms allow models to weigh the importance of different input elements when making predictions or generating outputs. The attention enables models to focus on some words or tokens within the input sequence being processed to generate the response or make the prediction. These mechanisms can be described graphically and are valuable for understanding the various operations within the system by detailing the parts or areas of the input that are acceptable for a given task. The most common illustrations employed when demonstrating attention focus include heat maps, which depict the level of attention focused on individual input tokens. For example, certain words may be highlighted in a sentence to indicate that they received more attention from the model when predicting the next word or generating a response.

4. Results

In this section, we present the results obtained from the different models employed and described in Section 3.3.1 based on the different versions of the dataset augmented using the techniques explained in Section 3.2. In order to perform the fine-tuning and the few-shot learning training, we implemented a custom framework that trains a selected model based on the given dataset, which, in this case, is the one described previously. We used the hyperparameters shown in Table 4 to train the models. We evaluated each model on the test set, considering a positive class the sentences labeled as security requirements. Considering these labeled sentences as the positive class, we could measure the models’ performance in recognizing and distinguishing security-related information from other text types. This evaluation approach allowed us to gain insights into the strengths and weaknesses of each model and on each dataset.

4.1. Fine-Tuning Results

Table 4 summarizes the hyperparameters used to fine-tune the models across the four datasets employed and based on the different augmentation techniques. We use the same hyperparameters to ensure that any differences in the model’s performance can be attributed to the data rather than variations in the training process. Learning rate, batch size, and number of epochs can significantly affect model performance, and then using the same can also make it easier to reproduce results. For this reason, we fine-tuned the models in all the experiments using ten epochs with an initial learning rate of

2 e - 5

.

Table 5 reports the models’ accuracy, precision, recall, and f1-score across a standard dataset and three augmented variations created using contextual word embeddings, back translation, and paraphrasing techniques. These augmentation methods aim to introduce variation to the dataset, potentially enhancing model generalization for security and non-security requirement classification tasks.

Starting with the normal dataset, it can be seen that both BERT and DistilBERT attain high accuracy scores of 0.89, with XLNet and RoBERTa coming in second and third, respectively, at 0.90. This consistency shows that all models perform well even when data augmentation is not used, indicating a solid baseline performance. BERT and DistilBERT balance precision and recall, but DistilBERT performs marginally better in recall (0.86 vs. 0.76), indicating that it could catch more pertinent examples. However, the RoBERTa model achieves higher scores in all the calculated metrics, suggesting it effectively captures the correct classifications when working with the base dataset. It is possible thanks to RoBERTa’s more modern architecture, which improves attention mechanisms over classic BERT. Regarding the dataset obtained using contextual word augmentation, BERT retains great precision and recall while achieving a higher accuracy (0.94) than the original. This improvement shows how flexible BERT is to small contextual and lexical changes brought about by embedding-based augmentation. However, RoBERTa drops significantly regarding the recall (0.57). It could suggest that the model may be sensitive to embedding augmentation, which could interfere with its internal representations to classify data. XLNet, in contrast, continues to perform well, suggesting that its bidirectionality may enable it to incorporate enriched settings more effectively than RoBERTa. In the context of the dataset obtained through the back translation process, again, BERT and DistilBERT exhibit great precision and recall (scoring 0.96 in recall), and both achieve an accuracy of 0.92. This implies that the dataset is successfully diversified by back translation while retaining the essential semantic information required for these models to function correctly. Nevertheless, XLNet maintains its stability with a strong recall of 0.90 and an F1-score of 0.86, whereas RoBERTa has another performance decline, obtaining a lower recall (0.62). Finally, BERT and DistilBERT continue to perform well on the dataset that contains paraphrased sentences. This implies that both BERT-based models successfully reflect the variety in phrasing brought forth by paraphrasing techniques. RoBERTa may have trouble with the increased lexical diversity that paraphrasing introduces, as evidenced by its lower recall (0.71). This can be because while RoBERTa excels at token-level contextual relationships, paraphrasing disrupts sentence structures and alters key terms, introducing linguistic variability that the model struggles to generalize. It performs strongly on datasets that maintain consistent syntax and vocabulary, as its masked language modeling objective allows it to capture relationships between tokens in highly contextualized settings. However, paraphrasing introduces variations that RoBERTa may not generalize well if its attention has been overfitted to specific patterns during fine-tuning. With an accuracy of 0.89 and a balance between precision and recall, XLNet does rather well, showing that it can handle paraphrasing more effectively than RoBERTa but with more variability than BERT.

Overall, BERT and DistilBERT exhibit steady flexibility across a range of datasets, with DistilBERT proving effective even in more challenging situations brought about by augmented data. Even while RoBERTa is effective in the standard dataset, it exhibits considerable sensitivity to augmentation, which may indicate that more fine-tuning is required to maximize its utility with extremely diverse datasets. XLNet’s performance is competitive and stable but shows a different augmentation resilience than BERT, mainly when dealing with higher lexical diversity.

Impact of Dataset Augmentation on Model Performance

To evaluate the impact of dataset augmentation on the performance, we created the confusion matrix of the BERT model in the standard dataset and the contextual word augmentation and compared them. As can be seen in Figure 6a, in this matrix, the model correctly identified 39 non-security requirements (True Negatives) and 16 security requirements (True Positives). However, it misclassified two non-security requirements as security (False Positives) and failed to identify five security requirements (False Negatives). In contrast, the confusion matrix (Figure 6b) for the model trained on the augmented dataset maintained the same True Negatives (39) and False Positives (2) as the original model, indicating consistent performance in identifying non-security requirements. The augmented model significantly reduced the False Negatives to 2 and increased the True Positives to 19. This improvement suggests that the augmentation process enhanced the model’s ability to classify security requirements accurately. Overall, the results indicate that contextual word embeddings for dataset augmentation positively influenced the model’s performance, leading to a more effective classification of security-related requirements.

4.2. Few-Shot Learning Results

Given the size of the dataset, we used a technique called few-shot learning, described in Section 3.3.2, that enables models to learn how to learn and adapt quickly to new tasks based on a small number of labeled examples. As a result, the models become more flexible and adaptable. Based on the results described in Section 4.1, we selected the model with the best performance considering the average metrics. Specifically, considering the Table 5, for the normal dataset, we selected the RoBERTa-base model; regarding the dataset augmented using contextual word embeddings and the other augmented using the back translation technique, we used BERT-base-uncased. Lastly, regarding the dataset augmented using the paraphrasing technique, we used DistilBERT-base-uncased. After this phase, we trained the models using transfer learning-based methods that adapt the selected pre-trained models to learn from a few data to classify the requirements. We trained each model for the specified dataset for two epochs, reaching the results shown in Table 6.

Based on its strong performance in the previous testing, RoBERTa was chosen for the normal dataset. Its performance in the few-shot scenario, where it attained an exceptional F1-score of 0.97, accuracy of 0.92, and precision of 0.94, confirms this selection. In particular, the strong F1-score highlights RoBERTa’s ability to handle unaugmented data, indicating that its contextual handling fits in well with the original dataset structure and enables it to perform very well in correctly categorizing security requirements.

BERT was selected for the contextual word embeddings augmented dataset because previous studies consistently demonstrated its ability to navigate contextually modified data. BERT leads the few-shot outcomes with an accuracy of 0.94 and consistent scores for precision, recall, and F1-score at 0.90. These findings demonstrate that BERT’s design can adjust to subtle rephrasings while preserving semantic integrity, successfully capturing the little but significant lexical changes from contextual embeddings.

BERT manages the various rephrasings brought about by back translation with only a minor memory loss, as seen by its accuracy of 0.90 and F1-score of 0.86. This could result from some sensitivity to the more extensive syntactic alterations in the translated text. However, BERT’s ability to maintain excellent performance in back-translated augmentations and contextual embeddings indicates that it can adjust to subtle and more noticeable rephrasings.

Finally, DistilBERT performed well in few-shot learning, achieving an accuracy of 0.90, precision of 0.89, and F1-score of 0.85, despite the greater lexical diversity of paraphrased lines. DistilBERT may overlook some pertinent examples in this situation, as evidenced by its poorer recall compared to precision, which points to a more selective performance that prioritizes categorization accuracy. Although additional fine-tuning may enhance recollection for substantially paraphrased inputs, this result confirms DistilBERT as a reliable option for managing datasets with varying phrasings.

When considering the training duration, the contrast between few-shot learning and comprehensive fine-tuning becomes significant. The few-shot learning strategy demonstrates a considerable improvement in efficiency, which requires only two training epochs. Nevertheless, it achieves performance levels comparable to or nearly completely fine-tuned models trained during ten epochs. In addition to cutting down on training time and computing expenses, few-shot learning demonstrates how well the models can identify critical patterns in the data with little exposure.

4.3. Explainability Results

Important information about how these fine-tuned transformers understand security requirement texts can be found in the attention visualization heatmaps produced for every model and dataset combination. We can better understand how each model extracts and interprets essential information from the input sequences by examining the attention patterns across models and datasets. The latter involves an analysis to identify key factors in models that influence their predictions to make models transparent and reliable. This enables organizations to understand the reasoning behind a model’s predictions, ensuring trust in the results. In order to create the attention visualization heatmaps, we randomly selected security and non-security requirements text, shown in Table 7, and submitted them to the dataset’s best model. We extract each token’s input IDs and attention weights from each sentence. Then, we converted each token back into the human-readable format and plotted the results. After this phase, we plot the attention visualization of the last layer of the model.

Concerning the so-called normal dataset (Figure 7), RoBERTa was the choice selected for analysis. The heatmap of RoBERTa attention shows rather diverse attention, but of note were the words “security”, “measures”, and “provider”. Although many different tokens need attention, RoBERTa allocates the necessary focus toward all significant tokens, explaining its effectiveness on non-augmented data. These words’ locating attention suggests that RoBERTa’s contextual understanding incorporates these key security requirement concepts whenever they contextualize the intent and range of the requirement meaning itself. The balanced attention facing these words is probably one of the reasons why RoBERTa achieved high accuracy and precision results on this dataset, which does not require any accommodation of augmented or modified contexts.

Within the dataset on word augmentation, the BERT model was selected, and further confirmed through the corresponding attention heatmap (Figure 8) that focuses more on several keywords: “internal”, “security”, and “required”. In this case, the attention is exploited more towards those words, which are supposed to be the target words of a security requirement itself, ignoring the contextual words more than RoBERTa does. The increased focus on those terms of interest also matches the word-level augmentations presented in the word augmentation from this dataset. At the same time, as BERT focuses on the core purpose of the security requirement, the noise in the word-augmented dataset does not adversely affect it, as BERT ignores the additional changes introduced by word augmentation techniques. Such a focus attention strategy might be why BERT obtains high accuracy and F1 scores, which had the word-level augmentation.

Similarly, BERT was selected for the back translation dataset. The heatmap in Figure 9 reveals a more stable focus on words with meanings such as “inside” and “security”. However, compared to Figure 8, the attention distribution in this one is slightly more dispersed across a broader range of terms. Such a difference suggests that BERT looks at the data slightly differently depending on the augmentation in the dataset. In this heatmap, BERT seeks to give equal attention to the most significant information and the context in translation that may cause a change in the structure. Most importantly, such a network can remain relevant while losing a few pertinent features, such as translation. BERT’s attention dispersion, which is most likely able to alter classification and generalize, proves to be why BERT performs exceptionally well with a strong recall and F1 score on this dataset.

Lastly, for the paraphrased dataset, the DistilBERT attention heatmap (Figure 10) shows important information about how the model understands security requirements. It is evident from this heatmap that DistilBERT gives specific attention to phrases such as “applications”, “perform”, and “internal”. This model acknowledges the significance of these phrases in communicating the primary intent, as they are essential to the security need. In addition, since on average this model performs better than the other models, we compare the attention heatmaps of three different security and three non-security requirements to understand the choices made by this model to perform the predictions. The attention heatmap in Figure 11 highlights distinct patterns where specific key terms draw more focus. Indeed, the terms “responsible” and “generating” have high attention, while “generating” receives the most attention. That would mean the model focuses on action-oriented terms, which are crucial to the meaning of the security requirement. The model focuses on words that are semantically responsible for carrying the meaning of responsibility and operational tasks. Further, “loading” obtains an appreciable degree of attention; it is apparently because it describes the action with “generating”. In the last security requirement, as represented in Figure 12, “applications”, “expose”, and “only” are words that attract the highest attention values. The word “applications” has consistent emphasis, especially at the beginning of the sentence, which will likely reflect its role as a main subject. The term “expose” captures significant attention, suggesting that the model correctly focuses on the key action described. The adverb “only” similarly attracts robust attention, showing the crucial restricting of the exposure scope regarding the meaning of the requirement.

Considering the non-security requirements outlined in Table 7, Figure 13 depicts that the terms “authentication,” “routing,” and “connectivity” are relatively more prominent in the sentence. This is expected because these terms carry technical connotations. On the other hand, no word or term seems to have the highest concentration, so the attention is still spread across the sentence. For example, terms like “authentication” do not receive sufficient attention or relevance to security. Such a distribution reinforces the classification as a non-security requirement as no vital security-related keywords are identified. Instead, for the second sentence, the attention map (Figure 14) reflects the model’s understanding that this sentence relates to quality of service (QoS) and user experience enhancement rather than security concerns. The absence of words like “security”, “access control”, or “encryption” ensures a non-security classification, showing that the model can handle ambiguous terms like “functions” and “network” without misclassifying them as security-related. In the last sentence in Figure 15, attention focuses marginally on “storage” and “UGC” (user-generated content), as these terms imply functionality. However, the overall attention remains weak and scattered. This focus suggests that the model recognizes functional relevance but does not misinterpret these as security-critical elements.

5. Discussion

Information concerning model choice and augmentation techniques and overall classification effectiveness in security and non-security requirements were gained from the experiments conducted in this work, which involved the full fine-tuning and the SetFit optimization approaches. Augmentation techniques were substantial in improving the performance of the models. The baseline of the original dataset confirmed that RoBERTa and BERT models could achieve high accuracy without augmentation. One appealing feature of RoBERTa was that it outperformed its counterpart, meaning the model correctly understands the context in most raw data. On the other hand, the benefits of augmentation become more pronounced when examining the modified datasets. Word replacement augmentation altered meanings lexically at a quite minimal level by substituting, adding, or eliminating words using contextual embeddings. Following and implementing this technique improved BERT’s performance drastically, mainly after SetFit had been applied, where high accuracy, precision, and recall had been obtained. With its capacity for strong token-level representations, it seems to hint that BERT can cope with variations at the word level and thus can take advantage of this augmentation’s robustness. Word level augmentation seems to improve the ability of the model to generalize by providing it with minor language alterations that are crucial, allowing the model to recognize requirements across slightly modified scenarios. On the contrary, back translation involves translating sentences to another language and back, which tends to introduce syntactic changes without altering the original meaning. The slightly lower scores mean that although BERT can handle syntactic shifts in a different language or structure, this type of alteration is more difficult for him than word-level changes. This indicates that sentence-level alterations may add another complexity layer, necessitating more standardized representations. Nonetheless, BERT’s solid performance with back-translation highlights its adaptability to sentence rephrasings that do not fundamentally alter meaning.

The dataset obtained through paraphrasing using pre-trained LLMs has posed a different difficulty as it requires restructuring a sentence without changing its meaning. In this particular instance, DistilBERT was accurate, although its recall was inferior, which suggested that it had greater difficulty comprehending all pertinent security requirements under strong restatement. This pattern suggests that while DistilBERT can deal with a certain level of semantic change, it is possible that heavy changes are not best suited for it due to its architecture size. Using SetFit optimization, DistilBERT performed relatively high in accuracy and precision, confirming that the combination of fine-tuning and SetFit can help smaller models generalize across varied paraphrases, even though recall remains a slight concern.

Regarding explainability, the attention heatmaps of the final layer for each model and dataset show the explainability findings, which provide light on how these models prioritize and interpret data from various augmentation methods. Thanks to each heatmap’s visual breakdown of word-level attention, we can examine how each model interprets and distinguishes security-related phrases in various input settings.

RoBERTa showed a focused attention pattern on important security-related phrases, such as “security” and “measures”, for the normal dataset, suggesting that it correctly concentrates on the terms that best reflect security needs. RoBERTa’s excellent accuracy and F1 scores on the NR dataset are consistent with the attention heatmap’s robust and focused attention.

BERT’s attention heatmap displays a more scattered pattern in the word replacement dataset, illustrating how the model adapts to small linguistic changes brought about by word additions, substitutions, or deletions. This model spreads attention to nearby keywords to offset the increased lexical diversity while giving security-relevant words more weight. In line with its impressive results in the WR dataset, this distribution demonstrates the model’s versatility and capacity to extract contextual information in augmented environments. The fact that BERT focuses on related phrases indicates that it gains more comprehensive knowledge and becomes more robust to word-level disruptions.

Because back-translation adds grammatical alterations and rephrased sentence structures, BERT’s attention heatmap becomes even more distributed with the back-translation dataset. In order to comprehend content inside the modified sentence structures, BERT compensates for these syntactic modifications by capturing several words around the important security-related keywords, as seen by its attention distribution. This broader focus is consistent with the model’s somewhat lower scores than the augmented dataset, suggesting that although BERT can sustain performance, back-translation is more complex than more straightforward word-level augmentations. According to the heatmap, BERT continues identifying pertinent phrases while maintaining semantic integrity by distributing them more widely across context words.

Compared to the earlier datasets, DistilBERT displays a more scattered attention pattern for the paraphrase dataset. The model finds it more challenging to concentrate on particular terms when paraphrasing since it creates significant alterations in sentence structure. DistilBERT responds by focusing on many words inside each sentence, indicating that it aims to capture the context as a whole rather than simply essential phrases. Given that it could be difficult to recognize all pertinent security phrases in highly rephrased sentences, this diffusion is consistent with DistilBERT’s reduced recall in this dataset. The broader attention pattern suggests a cautious approach to capturing meaning, which leads to a trade-off where some pertinent phrases may be missed, even while the model retains high accuracy.

Limitations

The quantitative findings and the explainability study show the shortcomings of current methods for classifying security requirements, emphasizing significant issues with model performance, data augmentation, and interpretability. First, managing the language variety brought forth by data augmentation presents a challenge. Although augmentation methods such as contextual word embeddings, back-translation, and paraphrasing might improve model robustness and boost data diversity, they can also contribute irrelevant information or noise. For example, paraphrasing and back-translation frequently drastically change sentence structure, which may weaken the meaning of crucial terms or diminish language about security. This may result in models misinterpreting important terms on supplemented datasets, showing up as somewhat poorer recall and F1 score. As seen by DistilBERT’s performance on the paraphrased dataset, this heterogeneity makes it difficult for the models to maintain high accuracy across various language expressions, particularly when paraphrasing. Besides, even training these models over a small dataset requires selecting hyperparameters carefully, and the time for training them is significantly high. SetFit optimization, thanks to its efficiency, addresses these challenges of model fitting quite well. However, some issues remain particularly related to overfitting on small training sets or overfitting on augmentation. Models trained over a small dataset tend to memorize the training data too well and not generalize well to new unseen data, not in the training set. This poses severe threats in the case of security requirement classification, where the model is expected to perform well on various iterations in language structure and terminologies.

6. Conclusions and Future Work

This research presented a method to classify security and non-security requirements. This identification enables development teams to allocate resources and concentrate their efforts efficiently. A clear separation makes it easier to guarantee that these crucial areas are sufficiently addressed, as security needs frequently call for specialist expertise and techniques. Additionally, organizations may better analyze and reduce risks by establishing security needs often tied to possible vulnerabilities. This is especially crucial for adhering to industry norms and laws, which frequently call for specialized security precautions. Finally, security requirements can evolve due to changes in threats or regulations. Organizations may more easily monitor and adjust these criteria throughout the software lifetime and ensure they stay applicable and efficient by keeping a distinct categorization. This categorization improves software systems’ dependability and security. In order to carry out this work, we used an original labeled dataset of security and non-security requirements to fine-tune four different transformers that are BERT, DistilBERT, RoBERTa, and XLNET, achieving a maximum accuracy of 0.9, precision of 0.86, recall of 086 and f1-score of 0.86 with the RoBERTa model. Since the dataset classes are imbalanced, we used a data augmentation technique that generates contextual word embeddings to insert, substitute, or remove a word in the original requirement to generate alternatives and fine-tune the presented models. Additionally, we relied on the well-known back translation to generate alternatives to a requirement. Lastly, we introduce another technique to handle class imbalance using an augmentation technique consisting of three fine-tuned LLMs for paraphrasing tasks to generate coherent paraphrases of the minority class of the requirements. Using these techniques, we perform best by fine-tuning the models on the augmented dataset using contextual word embeddings, obtaining an accuracy value of 0.94, precision of 0.9, recall of 0.9, and f1-score of 0.9.

In addition, we relied on SetFit, a framework that aims to improve few-shot learning for text classification tasks. Based on the models’ performances previously described, we select for each created dataset the best model and use for the few-shot task, obtaining a maximum value of accuracy by using BERT on the augmented dataset with contextual word embeddings, obtaining an accuracy value of 0.94, precision of 0.9, recall of 0.9, and f1-score of 0.9. The results highlight how crucial model selection is for particular dataset attributes. When data are left unchanged, RoBERTa’s robust contextual representations are demonstrated by its improved performance on the normal dataset. BERT is perfect for jobs with augmented inputs because of its performance with word augmentation and back translation data, demonstrating its resilience to word and sentence changes. The success of DistilBERT with the paraphrased data indicates that, although smaller models can manage rephrasings with SetFit’s help, they could have a little trouble remembering things when there are significant semantic changes. Also, augmentation and SetFit together show promise to improve model robustness in jobs involving requirement classification. While sentence-level augmentations are proper but add complexity, word-level augmentations provide the most reliable generalization gains, especially for BERT.

Finally, by random sampling a security requirement from the dataset, we use explainability to gather explanations for the decisions generated by fine-tuned models from the four different datasets. In order to achieve these results, we created heatmaps based on the attention mechanisms used by the models to understand which words or subwords the models pay more attention to in classifying the requirements. The results show that various augmentation methods produce unique attention patterns in each model, which reflects their interpretative approaches and constraints. While BERT’s flexible attention to the word augmented and back-translated datasets demonstrates its resilience to lexical and syntactic alterations, RoBERTa’s concentrated attention on the NR dataset implies it performs well with unchanged text. However, DistilBERT’s scattered focus in the paraphrased dataset shows its limitations in managing sophisticated paraphrasing. These heatmaps highlight the significance of model interpretability for comprehending how models adjust to augmented datasets. By visualizing attention, we can better understand why some models perform well under particular circumstances and how they react to different augmentation. This interpretability is crucial for requirement classification because it highlights each model’s advantages in managing complex security-related terminology, increasing openness, and fostering confidence in the models’ classification decisions.

In future works, we will complete a hyperparameter tuning stage to improve our models’ performance. This stage involves varying the parameters that define the various machine-learning algorithms to improve their accuracy and effectiveness. By adjusting these hyperparameters, we intend to enhance the models’ learning processes, increasing their predictive power while decreasing their overfitting tendencies. This will also entail reaching performances based on the models’ results in the precision, recall, F1 score, and mean squared error, among others, to outline the shortcomings. Additionally, we intend to use different models, also known as small language models, that may adapt and deal with small datasets better. Small language models are designed to be more efficient and effective when working with limited data; hence, they are ideal for tasks when large datasets are unavailable. If applied properly, these models enhance the results and make them more accurate or reliable, even in smaller datasets. This is highly desirable when collecting or generating large datasets that are unwarranted or impossible, such as in some research applications or when dealing with sensitive or proprietary data.

Author Contributions

Conceptualization, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Methodology, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Software, L.P. and F.M. (Francesco Mercaldo); Investigation, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Resources, F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Data curation, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Writing—original draft, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Writing—review & editing, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Supervision, F.M. (Francesco Mercaldo); Project administration, F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Funding acquisition, F.M. (Fabio Martinelli). All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by EU DUCA, EU CyberSecPro, SYNAPSE, PTR 22-24 P2.01 (Cybersecurity) and SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the EU—NextGenerationEU projects, by MUR—REASONING: foRmal mEthods for computAtional analySis for diagnOsis and progNosis in imagING—PRIN, e-DAI (Digital ecosystem for integrated analysis of heterogeneous health data related to high-impact diseases: innovative model of care and research), Health Operational Plan, FSC 2014-2020, PRIN-MUR-Ministry of Health, the National Plan for NRRP Complementary Investments D^3 4 Health: Digital Driven Diagnostics, prognostics and therapeutics for sustainable Health care, Progetto MolisCTe, Ministero delle Imprese e del Made in Italy, Italy, CUP: D33B22000060001, FORESEEN: FORmal mEthodS for attack dEtEction in autonomous driviNg systems CUP N.P2022WYAEW and ALOHA: a framework for monitoring the physical and psychological health status of theWorker through Object detection and federated machine learning, Call for Collaborative Research BRiC-2024, INAIL.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pasquale, L.; Spoletini, P.; Salehie, M.; Cavallaro, L.; Nuseibeh, B. Automating trade-off analysis of security requirements. Requir. Eng. 2016, 21, 481–504. [Google Scholar] [CrossRef]
Khan, R.A.; Khan, S.U.; Khan, H.U.; Ilyas, M. Systematic mapping study on security approaches in secure software engineering. IEEE Access 2021, 9, 19139–19160. [Google Scholar] [CrossRef]
Femmer, H.; Fernández, D.M.; Wagner, S.; Eder, S. Rapid quality assurance with requirements smells. J. Syst. Softw. 2017, 123, 190–213. [Google Scholar] [CrossRef]
Huang, P.; Li, C.; He, P.; Xiao, H.; Ping, Y.; Feng, P.; Tian, S.; Chen, H.; Mercaldo, F.; Santone, A.; et al. MamlFormer: Priori-experience guiding transformer network via manifold adversarial multi-modal learning for laryngeal histopathological grading. Inf. Fusion 2024, 108, 102333. [Google Scholar] [CrossRef]
Huang, P.; Luo, X. FDTs: A Feature Disentangled Transformer for Interpretable Squamous Cell Carcinoma Grading. IEEE/CAA J. Autom. Sin. 2024, 12, 1. Available online: https://www.ieee-jas.net/en/article/id/5070f13e-1ef7-4848-99f5-efaedf69792b (accessed on 19 November 2024).
Huang, P.; Xiao, H.; He, P.; Li, C.; Guo, X.; Tian, S.; Feng, P.; Chen, H.; Sun, Y.; Mercaldo, F.; et al. LA-ViT: A Network with Transformers Constrained by Learned-Parameter-Free Attention for Interpretable Grading in a New Laryngeal Histopathology Image Dataset. IEEE J. Biomed. Health Inform. 2024, 28, 3557–3570. [Google Scholar] [CrossRef] [PubMed]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Martinelli, F.; Mercaldo, F.; Petrillo, L.; Santone, A. Security Policy Generation and Verification through Large Language Models: A Proposal. In Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy, Porto, Portugal, 19–21 June 2024; pp. 143–145. [Google Scholar]
Martinelli, F.; Mercaldo, F.; Petrillo, L.; Santone, A. A Method for AI-generated sentence detection through Large Language Models. Procedia Comput. Sci. 2024, 246, 4853–4862. [Google Scholar] [CrossRef]
Sun, X.; Li, X.; Li, J.; Wu, F.; Guo, S.; Zhang, T.; Wang, G. Text classification via large language models. arXiv 2023, arXiv:2305.08377. [Google Scholar]
Petrillo, L.; Martinelli, F.; Santone, A.; Mercaldo, F. Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences. Electronics 2024, 13, 4057. [Google Scholar] [CrossRef]
Kant, N.; Puri, R.; Yakovenko, N.; Catanzaro, B. Practical text classification with large pre-trained language models. arXiv 2018, arXiv:1812.01207. [Google Scholar]
Dekhtyar, A.; Fong, V. Re data challenge: Requirements identification with word2vec and tensorflow. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, 4–8 September 2017; pp. 484–489. [Google Scholar]
Kobilica, A.; Ayub, M.; Hassine, J. Automated identification of security requirements: A machine learning approach. In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering, Trondheim, Norway, 15–17 April 2020; pp. 475–480. [Google Scholar]
Varenov, V.; Gabdrahmanov, A. Security requirements classification into groups using nlp transformers. In Proceedings of the 2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), Notre Dame, IN, USA, 20–24 September 2021; pp. 444–450. [Google Scholar]
Alhoshan, W.; Ferrari, A.; Zhao, L. Zero-shot learning for requirements classification: An exploratory study. Inf. Softw. Technol. 2023, 159, 107202. [Google Scholar] [CrossRef]
Knauss, E.; Houmb, S.; Schneider, K.; Islam, S.; Jürjens, J. Supporting requirements engineers in recognising security issues. In Proceedings of the Requirements Engineering: Foundation for Software Quality: 17th International Working Conference, REFSQ 2011, Essen, Germany, 28–30 March 2011; Proceedings 17. Springer: Berlin/Heidelberg, Germany, 2011; pp. 4–18. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Black, S.; Gao, L.; Wang, P.; Leahy, C.; Biderman, S. Gpt-neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow; Zenodo: Geneva, Switzerland, 2021; p. 58. [Google Scholar]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. arXiv 2019, arXiv:1912.08777. [Google Scholar]
Mercaldo, F.; Zhou, X.; Huang, P.; Martinelli, F.; Santone, A. Machine learning for uterine cervix screening. In Proceedings of the 2022 IEEE 22nd International Conference on Bioinformatics and Bioengineering (BIBE), Taichung, Taiwan, 7–9 November 2022; pp. 71–74. [Google Scholar]
Zhou, X.; Tang, C.; Huang, P.; Tian, S.; Mercaldo, F.; Santone, A. ASI-DBNet: An adaptive sparse interactive resnet-vision transformer dual-branch network for the grading of brain cancer histopathological images. Interdiscip. Sci. Comput. Life Sci. 2023, 15, 15–31. [Google Scholar] [CrossRef] [PubMed]
He, H.; Yang, H.; Mercaldo, F.; Santone, A.; Huang, P. Isolation Forest-Voting Fusion-Multioutput: A stroke risk classification method based on the multidimensional output of abnormal sample detection. Comput. Methods Programs Biomed. 2024, 253, 108255. [Google Scholar] [CrossRef] [PubMed]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Liu, Y. Roberta: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Yang, Z. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. (CSUR) 2020, 53, 1–34. [Google Scholar] [CrossRef]
Tunstall, L.; Reimers, N.; Jo, U.E.S.; Bates, L.; Korat, D.; Wasserblat, M.; Pereg, O. Efficient few-shot learning without prompts. arXiv 2022, arXiv:2209.11055. [Google Scholar]
Adelani, D.I.; Masiak, M.; Azime, I.A.; Alabi, J.; Tonja, A.L.; Mwase, C.; Ogundepo, O.; Dossou, B.F.; Oladipo, A.; Nixdorf, D.; et al. Masakhanews: News topic classification for african languages. arXiv 2023, arXiv:2304.09972. [Google Scholar]
Pannerselvam, K.; Rajiakodi, S.; Thavareesan, S.; Thangasamy, S.; Ponnusamy, K. SetFit: A Robust Approach for Offensive Content Detection in Tamil-English Code-Mixed Conversations Using Sentence Transfer Fine-tuning. In Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, Dublin, Ireland, 26 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 35–42. [Google Scholar]

Figure 1. The workflow of the proposed method to classify security and non-security requirements that rely on transformers fine-tuning and Few-Shot learning with transfer learning models leveraging pre-trained models using data augmentation.

Figure 2. Average sentence length distribution about its category.

Figure 3. The most 20 frequent words in the dataset.

Figure 4. Number of security and non-security requirements in the dataset.

Figure 5. Size of the three different training datasets after the word augmentation, back translation, and paraphrasing process.

Figure 6. Confusion matrix for the models evaluated on the test set, providing insights into the model’s classification accuracy. (a) Confusion matrix obtained for BERT-base-uncased on the standard dataset. (b) Confusion matrix obtained for BERT-base-uncased on the contextual word augmentation dataset.

Figure 7. Heatmap showing the attention visualization of the last layer (12th layer) of the Roberta-base model fine-tuned in the original dataset.

Figure 8. Heatmap showing the attention visualization of the last layer (12th layer) of the BERT-base-uncased model fine-tuned in the dataset obtained by the contextual embedding word augmentation process.

Figure 9. Heatmap showing the attention visualization of the last layer (12th layer) of the BERT-base-uncased model fine-tuned in the dataset obtained by the back-translation augmentation process.

Figure 10. Heatmap for the first security requirement showing the attention visualization of the last layer (sixth layer) of the DistilBERT-base-uncased model, fine-tuned in the dataset obtained by the paraphrase augmentation process.

Figure 11. Heatmap for the second security requirement showing the attention visualization of the last layer (sixth layer) of the DistilBERT-base-uncased model, fine-tuned in the dataset obtained by the paraphrase augmentation process.

Figure 12. Heatmap for the third security requirement showing the attention visualization of the last layer (sixth layer) of the DistilBERT-base-uncased model, fine-tuned in the dataset obtained by the paraphrase augmentation process.

Figure 13. Heatmap for the first non-security requirement, showing the attention visualization of the last layer (sixth layer) of the DistilBERT-base-uncased model fine-tuned in the dataset obtained by the paraphrase augmentation.

Figure 14. Heatmap for the second non-security requirement, showing the attention visualization of the last layer (sixth layer) of the DistilBERT-base-uncased model fine-tuned in the dataset obtained by the paraphrase augmentation.

Figure 15. Heatmap for the third non-security requirement, showing the attention visualization of the last layer (sixth layer) of the DistilBERT-base-uncased model fine-tuned in the dataset obtained by the paraphrase augmentation.

Table 1. Text examples of labeled security and non-security requirements.

Text	Label
The GlobalPlatform API shall provides services to the application such as card verification and security services	sec
Send transactions to other processors in a standard format defined in agreement with the scheme provider	nonsec

Table 2. The dataset’s statistics show the number of security and non-security requirements for each set and the total number of elements.

Type	No. of Security Requirements	No. of Non-Security Requirements	Total
Train	166	331	497
Test	21	41	62
Validation	21	43	64

Table 3. Examples of an original security requirement and the generated sentences using word augmentation, back translation, and paraphrasing for the data augmentation.

Text	Type
The capacity of the authorized entities should depend on the security policies defined by the service providers, managing the CNG.	original
the capacity of an authorized customer might depend on the security policies designed by authorized service providers, managing resources.	word augmentation
Authorized companies must be able to stand up to service provider-defined safety measures, which monitor CNG.	back translation
accordingly, the capacity of the authorized entities should depend on the security policies defined by the service providers who manage CNG	paraphrasing

Table 4. Summary of the hyperparameters used to train the presented models. These parameters have been used for all the four versions of the dataset.

Model	Learning Rate	Weight Decay	Batch Size	Epochs
BERT-base-uncased	$2 e - 5$	0.01	8	10
DistilBERT-base-uncased	$2 e - 5$	0.01	8	10
RoBERTa-base	$2 e - 5$	0.01	8	10
xlnet-base-cased	$2 e - 5$	0.01	8	10

Table 5. Results of the fine-tuned models in terms of accuracy, precision, recall, and F1-score on the four datasets.

Model	Accuracy	Precision	Recall	F1-Score
Normal dataset
BERT-based-uncased	0.89	0.89	0.76	0.82
DistilBERT-base-uncased	0.89	0.89	0.86	0.82
RoBERTa-base	0.9	0.86	0.86	0.86
xlnet-base-uncased	0.9	0.89	0.81	0.85
Contextual word augmented dataset
BERT-based-uncased	0.94	0.9	0.9	0.9
DistilBERT-base-uncased	0.89	0.85	0.81	0.83
RoBERTa-base	0.84	0.92	0.57	0.71
xlnet-base-uncased	0.92	0.94	0.81	0.87
Back translation augmented dataset
BERT-based-uncased	0.92	0.9	0.96	0.88
DistilBERT-base-uncased	0.92	0.9	0.96	0.88
RoBERTa-base	0.84	0.87	0.62	0.71
xlnet-base-uncased	0.9	0.83	0.9	0.86
Paraphrasing augmented dataset
BERT-based-uncased	0.9	0.86	0.86	0.86
DistilBERT-base-uncased	0.94	0.9	0.9	0.9
RoBERTa-base	0.87	0.88	0.71	0.79
xlnet-base-uncased	0.89	0.82	0.86	0.84

Table 6. Results of the few-shot learning process regarding accuracy, precision, recall, and F1-score on the four datasets.

Model	Accuracy	Precision	Recall	F1-Score
Normal dataset
RoBERTa-base	0.92	0.94	0.81	0.97
Contextual word augmented dataset
BERT-based-uncased	0.94	0.9	0.9	0.9
Back translation augmented dataset
BERT-based-uncased	0.9	0.86	0.86	0.86
Paraphrasing augmented dataset
DistilBERT-base-uncased	0.9	0.89	0.81	0.85

Table 7. Security and non-security requirements randomly extracted from the dataset and used to perform explainability by the best model fine-tuned on the normal dataset ^⋄, contextual word augmented dataset ^†, back-translated augmented dataset *, and paraphrased dataset ^★.

Text	Model	True Label	Predicted Label
Security requirement
Applications should Perform internal security measures required by the Application Provider	RoBERTa-base ^⋄	sec	sec
	BERT-based-uncased ^†	sec	sec
	BERT-based-uncased *	sec	sec
	DistilBERT-base-uncased ^★	sec	sec
The Card Issuer is responsible for Generating and loading the Issuer Security Domain keys	DistilBERT-base-uncased ^★	sec	sec
Applications should expose only data and resources that are necessary for proper application functionality	DistilBERT-base-uncased ^★	sec	sec
Non-Security requirement
Standard NGN authentication routing and connectivity should be used to establish remote access connectivity	DistilBERT-base-uncased ^★	non sec	non sec
The NGN network QoS functions shall be possible to use to enhance the end-user experience	DistilBERT-base-uncased ^★	non sec	non sec
CPN should support storage of UGC within the CPN	DistilBERT-base-uncased ^★	non sec	non sec

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Petrillo, L.; Martinelli, F.; Santone, A.; Mercaldo, F. Explainable Security Requirements Classification Through Transformer Models. Future Internet 2025, 17, 15. https://doi.org/10.3390/fi17010015

AMA Style

Petrillo L, Martinelli F, Santone A, Mercaldo F. Explainable Security Requirements Classification Through Transformer Models. Future Internet. 2025; 17(1):15. https://doi.org/10.3390/fi17010015

Chicago/Turabian Style

Petrillo, Luca, Fabio Martinelli, Antonella Santone, and Francesco Mercaldo. 2025. "Explainable Security Requirements Classification Through Transformer Models" Future Internet 17, no. 1: 15. https://doi.org/10.3390/fi17010015

APA Style

Petrillo, L., Martinelli, F., Santone, A., & Mercaldo, F. (2025). Explainable Security Requirements Classification Through Transformer Models. Future Internet, 17(1), 15. https://doi.org/10.3390/fi17010015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Security Requirements Classification Through Transformer Models

Abstract

1. Introduction

2. Related Work

2.1. LLMs for Text Classification

2.2. Automatic Requirements Classification

3. The Method

3.1. Dataset

3.2. Data Augmentation

3.3. Models

3.3.1. Fine-Tuning

3.3.2. Few-Shot Learning

3.4. Explainability

4. Results

4.1. Fine-Tuning Results

Impact of Dataset Augmentation on Model Performance

4.2. Few-Shot Learning Results

4.3. Explainability Results

5. Discussion

Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI